🎨 Image Generation¶
📷 CVPR2026 · 434 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (357) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (218) · 📹 ICCV2025 (213)
🔥 Top topics: Diffusion Models ×137 · Text-to-Image ×35 · Alignment/RLHF ×23 · Image Editing ×21 · Layout & Composition ×20
- 2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
-
A fine-tuning framework named 2ndMatch is proposed. By aligning the second-order Jacobian matrix \(J^\top J\) (inspired by Finite-Time Lyapunov Exponents) of the pruned model with the original model, it matches the temporal sensitivity to input perturbations, significantly narrowing the generation quality gap.
- 3D Space as a Scratchpad for Editable Text-to-Image Generation
-
This paper proposes treating an editable 3D scene as a "spatial scratchpad" for text-to-image generation. A suite of LLM agents parses text prompts into subject meshes, plans placements/orientations/cameras in 3D, and renders this layout into an image via identity-preserving depth-conditioned generation. It achieves a 32% training-free improvement in text alignment on GenAI-Bench and supports consistent image updates through simple 3D modifications.
- A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
-
T2LDM utilizes a "Guidance Network" (SCRG) that provides geometric reconstruction supervision during training but is discarded at inference, along with Directional Positional Encoding (DPE) to correct street distortion from spherical projection. It generates finely structured and controllable LiDAR scenes despite the extreme scarcity of Text-LiDAR pairs, and introduces the controllability benchmark T2nuScenes and the TBR metric.
- A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
-
CoTyle conjures novel and reproducible visual styles using a single numeric code. It accomplishes "one number = one style" for the first time in the open-source community by training a discrete style codebook to compress images into style indices, a T2I diffusion model to generate images conditioned on these indices, and an autoregressive generator to create new style index sequences from scratch.
- A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
-
Addressing the limitation where pose/appearance control signals are injected with fixed intensity across all denoising steps in controllable hand generation, this paper proposes TCCA. It utilizes a set of learnable queries to align heterogeneous features—noisy latents, 3D pose, and appearance—into a unified space to dynamically adjust injection intensity step-by-step. Complementing this is a pose-invariant appearance encoder using SVD orthogonal decomposition to remove pose artifacts. The method outperforms FoundHand across FID/LPIPS/PCK metrics on datasets like InterHand2.6M.
- A Training-Free Style-Personalization via SVD-Based Feature Decomposition
-
Based on the scale-wise autoregressive model Infinity, this work discovers that the largest singular value component of the 3rd feature \(F_3\) in the generation process specifically encodes style information. Consequently, a training-free approach is proposed to inject the style of a reference image into this feature step using SVD (Principal Feature Blending), while stabilizing the structure via attention maps from a content branch (Structural Attention Correction). This achieves style fidelity comparable to fine-tuning methods in 3.58 seconds, which is up to 195 times faster.
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
-
Addressing the pain points of "sub-linear speedup and quality degradation" in multi-GPU diffusion inference, this paper leverages the inherent "conditional/unconditional dual-path" of Classifier-Free Guidance as the data parallelism splitting dimension (Conditional Partitioning). It then uses a metric for noise discrepancy (rel-MAE) to adaptively determine when to enable pipeline parallelism. On two RTX 3090 GPUs, it achieves 2.31× and 2.07× speedups for SDXL and SD3, respectively, with almost no loss in image quality.
- Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
-
Aiming at zero-shot image-to-image generation methods like IP-Adapter and InstantID that "clone faces or styles with a single image," this paper proposes Adapter Shield. It utilizes a pair of trainable "encryptor/decryptor" modules to map image encoder embeddings into garbled code based on a password. Multi-objective adversarial perturbations are then used to "anchor" the original image to these garbled embeddings. This causes unauthorized users to generate distorted results, while authorized users with the correct password can decrypt the embeddings for normal use—marking the first unified framework in this field to combine "protection" and "authentication."
- Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
-
Proposes Adaptive Auxiliary Prompt Blending (AAPB), which derives a closed-form adaptive blending coefficient via Tweedie’s formula to dynamically balance the contributions of auxiliary anchor prompts and target prompts at each denoising step. This training-free approach significantly improves semantic accuracy and structural fidelity in rare concept generation and zero-shot image editing.
- Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
-
The authors propose Spectrum, a global spectral-domain feature forecasting method based on Chebyshev polynomials. By treating the intermediate features of the diffusion model denoiser as functions of time and fitting coefficients via ridge regression, it achieves long-range feature forecasting where errors do not accumulate with step size. Spectrum achieves a \(4.79 \times\) speedup on FLUX.1 and \(4.67 \times\) on Wan2.1-14B with nearly no loss in quality.
- Advancing Image Classification with Discrete Diffusion Classification Modeling
-
The authors transform image classification from "one-shot label prediction" into "running a diffusion process in a discrete class label space to approximate the posterior \(P(c\mid y)\)." By predicting a Concrete Score for iterative denoising, the method outperforms equivalent ResNets on ImageNet with only a few diffusion steps, with the performance gap Widening as input degradation (low resolution / sparse data) increases.
- AE2VID: Event-based Video Reconstruction via Aperture Modulation
-
Addressing the pain points where event-based video reconstruction fails in static regions and suffers from error accumulation when relying solely on sparse motion events, this paper actively modulates the aperture periodically. This "passively triggers" dense events even in static regions, allowing for the resolution of dense intensity reference maps. A dual-network architecture (AENet for aperture events and MENet for bidirectional fusion of motion events) is then used to reconstruct high-speed HDR video, achieving a 27.4% reduction in MSE compared to the SOTA on EvAid.
- Agentic Retoucher for Text-To-Image Generation
-
Agentic Retoucher reformulates defect repair after T2I generation into a human-like "Perception \(\rightarrow\) Reasoning \(\rightarrow\) Action" closed-loop decision process. Utilizing three collaborative agents for context-aware distortion detection, human-aligned diagnostic reasoning, and adaptive local repair, it improves plausibility by 2.89 points on GenBlemish-27K, with 83.2% of results rated by humans as superior to the original images.
- AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
-
AHS overcomes the limitations of self-supervised training by utilizing a head reenactment model (GAGAvatar) to generate synthetic augmented data. Combined with a dual-encoder attention mechanism and an adaptive mask strategy, it achieves SOTA performance in head swapping tasks for full-body images.
- Align Images Before You Generate
-
The authors discover that the intermediate noisy features of multi-image diffusion models "natively" encode cross-image correspondences. Consequently, they propose CorrAdapter—a training-free, plug-and-play bypass branch that requires no external geometric or semantic priors. It utilizes these native correspondences to align matching regions before the images are fully generated, significantly enhancing spatio-temporal consistency in multi-view and video generation.
- Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
-
To address the "semantic misalignment, identity confusion, and visual degradation" issues in multi-character narrative image generation, this paper constructs a fine-grained preference dataset, NI-RLHF, containing textual critiques. It trains an explainable reward model, NIReward, which "generates critiques before scoring," and utilizes it to drive the ADPO preference optimization algorithm. This approach aligns the generative model with human preferences across prompt following, identity consistency, and visual quality dimensions.
- All-in-One Slider for Attribute Manipulation in Diffusion Models
-
The All-in-One Slider framework is proposed, which trains an Attribute Sparse Autoencoder (SAE) on the text embedding space to decouple multiple facial attributes into sparse semantic directions. This enables a single lightweight module to achieve fine-grained continuous control of 52+ attributes, supporting multi-attribute composition and zero-shot manipulation of unseen attributes.
- Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
-
Proposes KAB (Keyframe-Anchored Attention Bias) and ReTRo (Rescaled Temporal RoPE), two training-free inference-time methods based on the Wan2.1 video diffusion model, to address semantic infidelity, frame inconsistency, and rhythm instability in generative inbetweening (GI) with large motion under sparse keyframes. It also constructs TGI-Bench, the first text-conditioned GI evaluation benchmark.
- Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
-
The Ani3DHuman framework is proposed, combining kinematic-driven mesh animation with video diffusion priors. Through Self-guided Stochastic Sampling, low-quality rigid renderings are restored into high-fidelity videos, achieving realistic modeling of non-rigid clothing dynamics.
- Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
-
Ar2Can proposes decomposing multi-human image generation into two stages: spatial planning (Architect) and identity-preserving rendering (Artist). By utilizing GRPO reinforcement learning combined with a spatial-anchored face reward function based on Hungarian matching to train the Artist model, it achieves an identity preservation score of 68.2 and a count accuracy of 90.2 on the MultiHuman-Testbench, significantly surpassing all baselines.
- Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
-
This paper discovers through linear probing that implicit decisions in diffusion models (e.g., defaulting to generating a male when gender is unspecified) are primarily controlled by self-attention layers rather than cross-attention layers. Based on this, the ICM method is proposed, achieving SOTA debiasing effects by intervening only on a few key self-attention layers while minimizing image quality degradation.
- Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
-
APPLE utilizes a pure diffusion teacher-student framework for face swapping by training a precise teacher to generate high-quality pseudo-labels for the student. The teacher employs conditional deblurring (instead of full-face masking) to preserve the target's skin tone, lighting, and pose, while attribute-aware inversion anchors fine-grained attributes (makeup, glasses) into the noise to produce clean pseudo-labels. The student learns exclusively from these clean pseudo-labels, ultimately achieving SOTA in attribute preservation (FFHQ FID 2.18, Pose 1.85) while maintaining competitive ID similarity.
- Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution
-
The attribution of AI-generated images is redefined from a classification paradigm to an instance retrieval paradigm. A model-agnostic framework, LIDA, based on low-bit plane fingerprints is proposed. Through unsupervised pre-training and few-shot attribution adaptation, SOTA performance in Deepfake detection and image attribution is achieved under zero-shot and few-shot settings.
- BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
-
Ours proposes BeautyGRPO, a reinforcement learning-based face retouching framework. By constructing a fine-grained preference dataset FRPref-10K to train a specialized reward model and designing a Dynamic Path Guidance (DPG) mechanism to balance random exploration and high fidelity, the framework achieves natural retouching results aligned with human aesthetic preferences.
- Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
-
This paper proves that "predictive feature caching" methods such as TaylorSeer and FoCa mathematically degenerate into fixed-coefficient linear combinations of historical features. Demonstrating that DiT feature trajectories are inherently highly linearly reconstructible, the authors propose \(L^2P\)—replacing hand-derived fixed coefficients with a set of learnable linear weights for each timestep. Using only 50 images and 20 seconds of training on a single GPU, it accelerates diffusion sampling by 4.5–7.2× on FLUX/Qwen-Image while maintaining significantly higher PSNR than existing methods.
- Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
-
Addressing the issue where synthetic data generated by text-to-image (T2I) models for classifier training suffers from overfitting in fine-grained, few-shot scenarios, BOB explicitly extracts class-agnostic context (background, pose) from real images. These attributes are conditioned into prompts during fine-tuning to preserve diversity priors, and randomly paired across classes during generation to marginalize spurious associations. This improves CLIP classification accuracy on the Aircraft dataset from DataDream's 50.0% to 57.4%.
- Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
-
GAR-Font employs a combination of a "global-aware tokenizer + autoregressive generator + lightweight language adapter + GRPO post-refinement" to upgrade few-shot Chinese font generation from image-only patch-level modeling to a multimodal autoregressive framework that balances local strokes with global style. It can supplement style intent with a single text description, matching the generation quality of 8 reference images with only 4 images and 1 sentence.
- Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
-
UniPath proposes a semantic-driven pathology image generation framework that achieves diagnostic-level controllable generation through multi-stream control (original text + diagnostic semantic tokens distilled from frozen pathology MLLMs + morphological control from a prototype library), reaching a Patho-FID of 80.9, which outperforms the runner-up by 51%.
- Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
-
TICoE collaboratively erases target concepts from text-to-image diffusion models using a "Continuous Convex Concept Manifold (text-side) + Multi-scale Hierarchical Visual Representation (image-side)." This approach blocks the "resurrection via rephrasing" loophole in text-based erasure while preventing image-guided over-erasure of unrelated concepts with similar shapes or contexts. On tasks like gun, nudity, and Van Gogh, it achieves simultaneously stronger erasure (UDA 0.02) and superior fidelity (FID 30.86).
- Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
-
This paper identifies the "Motion-Vision Quality Dilemma," where motion quality (MQ) and visual quality (VQ) in video data are negatively correlated. Through gradient analysis, it reveals that imbalanced data can produce equivalent learning signals at appropriate timesteps. It proposes the TQD framework, enabling models trained solely on imbalanced data to outperform those trained on "golden data."
- Bidirectional Normalizing Flow: From Data to Noise and Back
-
BiFlow removes the hard constraint in standard Normalizing Flow where the "backward process must be the exact analytical inverse of the forward process." Instead, it trains a separate backward model to approximate the inverse mapping (supervised by hidden state alignment). This allows the backward model to utilize a bidirectional attention Transformer, enabling image generation in a single forward pass (1-NFE). On ImageNet 256×256, a small 133M model achieves an FID of 2.39, which is both superior to and two orders of magnitude faster than its autoregressive counterpart, TARFlow.
- BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
-
BiFM enables the same flow matching model to simultaneously learn "noise-to-image" generation and "image-to-noise" inversion within a single training session. By constraining the average velocities of both directions using a shared instantaneous velocity field, it achieves high-fidelity inversion-based image editing under a 1~4 step budget, consistently outperforming existing few-step editing methods.
- BiGain: Unified Token Compression for Joint Generation and Classification
-
BiGain proposes a frequency-aware token compression framework. Using two training-free operators—Laplacian-gated token merging and interpolation-extrapolation KV downsampling—it simultaneously maintains generation quality and significantly improves discriminative classification performance for the first time in diffusion model acceleration.
- BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
-
BiMotion is proposed to compress variable-length motion sequences into a fixed number of control points using continuously differentiable B-spline curves. Combined with a specialized VAE and flow-matching diffusion model, it achieves fast, highly expressive, and semantically complete text-guided dynamic 3D character generation, outperforming existing methods in both quality and efficiency.
- BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
-
The BioVITA framework is proposed, comprising a million-scale tri-modal (image-text-audio) biological dataset, a two-stage alignment model, and a six-direction cross-modal species-level retrieval benchmark, achieving the first unified visual-textual-acoustic representation learning in the biological domain.
- Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
-
Focusing on closed-source text-to-image diffusion models, this paper proposes SD-MIA: instead of traditional methods that add noise to images and check denoising capabilities, it perturbs text instructions and monitors the stability of reconstructed images to determine whether an image was in the pre-training data. Under pure black-box constraints (text-in, image-out only), it achieves an AUC up to 10 points higher than gray-box baselines that access internal features.
- BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
-
The BlackMirror framework is proposed, which employs a two-stage process consisting of fine-grained instruction-response semantic deviation detection (MirrorMatch) and cross-prompt stability verification (MirrorVerify). It achieves universal detection of multiple backdoor attacks on T2I models under black-box conditions, reaching an average F1 score of 89.46%, significantly outperforming the existing black-box method UFID.
- Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
-
This paper reformulates "generating new concepts" as "image synthesis conditioned on class distributions." Using a lightweight encoder–decoder (DisTok), any class distribution or random latent vector is decoded into "creative tokens" that can be embedded into prompts. This approach unifies controllable conditional exploration and open-ended unconditional exploration, achieving SOTA in text-to-image alignment and human preference for creative generation while being 13–40 times faster than existing methods.
- Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
-
CODSR performs real-world image super-resolution via one-step diffusion: it first utilizes "local noise injection" based on gradient maps to activate generative priors in textured regions, then employs uncompressed LQ features to modulate U-Net intermediate layers for fidelity restoration, and finally constrains cross-attention with noun masks from Grounded-SAM2 for textual alignment. It achieves superior perceptual quality and competitive fidelity across four real-world datasets.
- C\(^2\)FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
-
This paper utilizes a rigorous upper bound on score discrepancy to prove that "conditional and unconditional distributions converge at an exponential rate during forward diffusion." Based on this, the fixed guidance weight \(\omega\) in CFG is replaced with an exponentially decaying time-varying control function \(\omega(t)\). This training-free, plug-and-play method further improves FID/IS to SOTA levels across various frameworks including DiT, SiT, Stable Diffusion, and EDM2.
- CARD: Correlation Aware Restoration with Diffusion
-
CARD generalizes the DDRM diffusion inverse problem solver from the "i.i.d. Gaussian noise" assumption to the "spatially correlated noise" found in real sensors. By applying the inverse square root of the covariance matrix \(\Sigma^{-1/2}\) to whiten observations into i.i.d. noise, it performs DDRM closed-form updates in the whitened measurement space. The method is entirely training-free and consistently outperforms existing methods in denoising, deblurring, and super-resolution on both synthetic correlated noise and the newly collected real rolling-shutter dataset CIN-D.
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
-
Ours proposes CARE-Edit, a condition-aware expert routing framework. By utilizing heterogeneous experts (Text/Mask/Reference/Base) coupled with a lightweight latent-attention router on a DiT backbone, it achieves dynamic computation allocation. This effectively resolves issues like color bleeding and identity drift caused by conflicting multimodal signals (text, mask, reference images) in unified image editors.
- CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion
-
The paper proposes CaReFlow, the first to use rectified flow for multimodal distribution mapping to bridge the modality gap. Through one-to-many mapping, source modality data points observe the global distribution of the target modality; adaptive relaxed alignment applies varying alignment strengths to modality pairs with different correlations; and cyclic rectified flow ensures no information is lost after mapping. It achieves SOTA on multiple multimodal affective computing benchmarks even with simple concatenation fusion.
- CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
-
To address the limitation in Interactive Text-to-Image Retrieval (I-TIR) where "all dialog turns share a static feature space," CAST introduces a lightweight module, CASR. This module dynamically "deforms" the latent space containing text and image features based on the context of each dialog turn. The Contextual Low-rank Projector (CLP) determines the semantic direction of the deformation, while the Context-Guided Modulator (CGM) determines the magnitude. On VisDial, CAST improves the 10-turn average R@1 from 48.44% (ChatIR) to 51.85%, with increasing advantages in later turns and negligible parameter overhead.
- CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
-
CaTok trains a diffusion autoencoder by binding "selecting 1D tokens within the time interval \([r,t]\)" with the "MeanFlow average velocity field objective." This ensures that the compressed 1D visual tokens possess both causality and balance, supporting both fast one-step generation and high-fidelity multi-step reconstruction. It achieves 0.75 rFID / 22.53 PSNR / 0.674 SSIM on ImageNet reconstruction with fewer training epochs.
- CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
-
Reinterprets Classifier-Free Guidance (CFG) as a feedback control process within flow-matching diffusion models, proposes a unified framework CFG-Ctrl, and designs a nonlinear feedback guidance mechanism SMC-CFG based on Sliding Mode Control (SMC). This approach significantly enhances semantic consistency and generation robustness at large guidance scales.
- CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
-
CG-Floor employs a "locate-then-draw" hierarchical framework for large-scale floorplan generation: first, a Graph Transformer predicts centroids and sizes of all rooms simultaneously, encoded as a "Size-Aware Semantic Centroid Heatmap" (SASCH) to anchor the global topology; then, a VQ-VAE codebook and a Vector Quantized Diffusion Transformer draw non-Manhattan (non-rectangular) room shapes guided by the SASCH. It reduces the FID from 79.7 to 16.0 on the large-scale MSD dataset.
- ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
-
ChArtist abstracts the data structures of "bar/line/pie" charts into minimalist skeletons as spatial conditions and overlays the subject conditions from reference images. It trains two independent LoRAs to learn these controls separately and uses spatial-gated attention during inference to ensure the subject adheres to the spatial structure, automatically generating pictorial charts that are both faithful to data and visually expressive.
- ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
-
To address data scarcity in few-shot and long-tail scenarios, ChimeraLoRA decomposes the LoRA of diffusion models into a shared matrix A (encoding class priors) and multiple per-image B heads (encoding instance details). By mixing multiple B heads with Dirichlet weights and applying Grounded-SAM box constraints during cropping to preserve target objects, the method synthesizes training sets that are both diverse and detail-rich. Downstream classification accuracy improves by an average of 2.1 percentage points across 9 datasets compared to the state-of-the-art.
- ChordEdit: One-Step Low-Energy Transport for Image Editing
-
Based on dynamic optimal transport theory, a low-energy Chord control field is derived to smooth the unstable naive edit field. This achieves training-free, inversion-free, high-fidelity real-time image editing for distilled one-step T2I models for the first time.
- Cinematic Audio Source Separation Using Visual Cues
-
Ours proposes the first audio-visual cinematic audio source separation (AV-CASS) framework, utilizing visual cues from dual video streams (facial and scene) via conditional flow matching for generative three-way audio separation (Speech/Effects/Music). The model is trained solely on synthetic data but generalizes to real-world movies.
- Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
-
Internal circuit mechanisms for spatial relation generation in Diffusion Transformers (DiT) are revealed through mechanistic interpretability: random embedding models employ a two-stage modular circuit (relation heads + object generation heads), whereas T5 encoder models fuse relation information into object tokens for single-token decoding, with significant differences in robustness between the two mechanisms.
- Closed-Form Concept Erasure via Double Projections
-
This paper proposes Double Projections (DP), which reformulates "concept erasure" for diffusion/flow-matching models into a two-step closed-form projection. It first projects target concepts into a "safe subspace" to obtain proxy vectors, and then constrains weight updates within the left null-space of preserved concepts. This achieves clean erasure of target concepts with near-zero damage to unrelated concepts in seconds without training.
- CoD: A Diffusion Foundation Model for Image Compression
-
Ours proposes CoD, the first compression-oriented diffusion foundation model. By learning joint end-to-end compression-generation optimization from scratch, it replaces Stable Diffusion in downstream diffusion codecs to achieve SOTA performance at ultra-low bitrates (0.0039 bpp), with training costs only 0.3% of SD.
- CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
-
CogniEdit utilizes an MLLM to decompose complex instructions into executable editing commands and employs dynamic token focus to let different network layers attend to attributes of varying granularities. It transforms GRPO from "single-step independent optimization" into "trajectory-level dense optimization" by accumulating gradients across consecutive denoising steps. The approach achieves SOTA performance on Kris-Bench and GEdit-Bench for fine-grained instructions involving color, quantity, and position without sacrificing general editing capabilities.
- CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation
-
Ours proposes CoLoGen, a unified image generation framework based on "Concept-Localization Duality." By employing progressive multi-stage training and a Progressive Representation Weaving (PRW) dynamic expert routing architecture, it simultaneously reaches or exceeds the performance of specialized models across instruction editing, controllable generation, and personalized generation.
- CompBench: Benchmarking Complex Instruction-guided Image Editing
-
CompBench is the first benchmark for instruction-guided image editing oriented toward complex real-world scenarios. By extracting high-density occlusion scenes from the Video Object Segmentation (VOS) dataset MOSE and employing an MLLM-Human collaboration framework with an instruction decomposition strategy, the authors constructed 3K+ high-fidelity editing samples across 9 tasks in 5 categories. This work systematically reveals fundamental shortcomings of current editing models in multi-object handling, spatial reasoning, and implicit reasoning.
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
-
To address the difficulty of text-to-image models in handling compositional prompts such as "multiple objects + attribute binding + spatial relationships," BIDPO extends Diffusion DPO into a bimodal (image + text) preference optimization. It incorporates a region-level loss weighting based on bounding boxes and an automated pipeline generating 94,000 preference pairs. On T2I-CompBench, it improves attribute binding by approximately 17% and overall performance by 10%.
- ConsistCompose: Unified Multimodal Layout Control for Image Composition
-
The paper proposes ConsistCompose, achieving layout-controllable multi-instance image generation within a unified multimodal framework by directly embedding layout coordinates into language prompts (the LELG paradigm). It constructs the ConsistCompose3M dataset with 3.4 million samples to provide layout and identity supervision. Coupled with a coordinate-aware CFG mechanism, it achieves a 7.2% mIoU Gain and a 13.7% AP Gain on COCO-Position while maintaining general multimodal understanding capabilities.
- Correspondence-Attention Alignment for Multi-View Diffusion Models
-
The authors reveal that 3D self-attention in multi-view diffusion models spontaneously learns "cross-view geometric correspondence" in deep layers, but this signal degrades under large viewpoint changes. Based on this, they propose CAMEO—a method that directly supervises a single deep attention layer with geometric correspondence maps. This approach doubles convergence speed, improves novel view synthesis quality, and is universal across different multi-view diffusion models.
- CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
-
CRAFT-LoRA employs a triad of "rank-constrained backbone fine-tuning (to create a decoupling-friendly initialization) + expert encoder branch routing (assigning content/style LoRAs to disjoint layers via prompt tokens) + temporal asymmetric CFG (stabilizing training-free fusion during inference)." This allows independently trained content and style LoRAs to combine cleanly, achieving SOTA in content similarity, style similarity, and overall GPT-4o scores.
- CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
-
CRAFT proposes an ultra-lightweight alignment method for diffusion models. It automatically constructs high-quality training sets using a Combined Reward Filtering (CRF) strategy and executes an enhanced SFT. Theoretically, CRAFT is proven to optimize the lower bound of Group Reinforcement Learning (RL), surpassing SOTA methods that require thousands of preference pairs using only 100 samples, while being 11-220x faster in training.
- CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
-
CREval replaces the "black-box scoring" of MLLMs with a VQA paradigm—generating truth-grounded binary questions and awarding points only for correct answers. Accompanied by CREval-Bench (874 samples across 3 categories and 9 dimensions), it decomposes evaluation into three interpretable metrics: Instruction Following (IF), Visual Consistency (VC), and Visual Quality (VQ). It reveals that current models still struggle with "free-form creative editing," particularly in preserving identity-defining elements of the original image.
- Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
-
Addressing text-driven 3D human motion editing, this paper utilizes "joint-anchored" and "time-anchored" Transformers to model the joint and time axes separately, integrating them via cross-axis fusion blocks. It incorporates an auxiliary task of regressing the Soft-DTW distance of source/target rotation trajectories, enabling the model to learn not only "when" to edit but also "which joints" to edit, achieving SOTA on MotionFix.
- Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
-
Ours proposes C-MET (Cross-Modal Emotion Transfer), which models the mapping of emotion semantic vectors between speech and facial expression spaces. It achieves speech-driven generation of extended emotions (e.g., sarcasm, charm) in talking face videos for the first time, with emotion accuracy exceeding SOTA by 14%.
- CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
-
CSF treats text-to-image (T2I) models as "semantic category generators." It samples repeatedly using a batch of compositional semantic prompts that are extremely rare in fine-tuning data (e.g., "a dangerous urban nocturnal animal") to extract the model's category distribution for ambiguous prompts as a fingerprint. Using Wasserstein distance and Bayesian attribution, it identifies the protected base family of a suspect model accessible only via API—passing the "dominance" criterion across 6 base families and 13 fine-tuned variants.
- CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
-
Proposes CTCal (Cross-Timestep Self-Calibration), which utilizes reliable text-image alignment (cross-attention maps) formed at small timesteps (low noise) to calibrate representation learning at large timesteps (high noise). This provide explicit cross-timestep self-supervision for text-to-image generation, outperforming existing methods on T2I-CompBench++ and GenEval.
- Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
-
To address the issue in GRPO training for text-to-image (T2I) where "uniform sampling causes half the prompts to yield no learning gain," CGPO utilizes the reward variance of an image group per prompt as an online signal for "partially mastered but unstable" learning. By adaptively increase sampling for prompts in this learning "sweet spot" and applying proportional fairness for category calibration, CGPO achieves performance gains and accelerates training speed by 2x on GenEval, T2I-CompBench++, and DPG Bench.
- Cycle-Consistent Tuning for Layered Image Decomposition
-
Ours proposes a cycle-consistent fine-tuning framework based on diffusion models. By jointly training a decomposition model and a synthesis model to achieve image layer separation (e.g., logo-object decomposition) and introducing a progressive self-improving data augmentation strategy, it achieves robust decomposition in non-linear layer interaction scenarios.
- D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation
-
This work is the first to apply Dataset Condensation (DC) to diffusion model training, proposing the two-stage D2C framework. The Select stage uses a diffusion difficulty score and interval sampling to select a compact subset, while the Attach stage appends textual and visual representations to each sample. Using only 0.8% of ImageNet (10K images), it achieves an FID of 4.3 in 40K steps, which is 100× faster than REPA and 233× faster than vanilla SiT.
- DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
-
The paper proposes Detail-Aligned VAE (DA-VAE), which introduces structured "detail channels" into the latent space of a pre-trained VAE with alignment constraints. This approach compresses token counts by 4x without retraining the diffusion model. It enables 1024 \(\rightarrow\) 2048 generation for SD3.5 with only 5 H100-days of fine-tuning, achieving a 6x speedup.
- DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
-
Addressing the slow sampling issue of Diffusion Bridge Models (DBM) in image-to-image translation (often requiring dozens or hundreds of network evaluations), DBMSolver requires no network modification or training. By revealing the "semi-linear structure" of Bridge SDEs/ODEs and deriving closed-form solutions using Exponential Integrators (EI), it surpasses previous SOTA results with only 6 steps (NFE). On DIODE, it reduces FID by 53% compared to second-order baselines at 20 NFE.
- Elucidating the SNR-t Bias of Diffusion Probabilistic Models
-
This paper reveals a widespread SNR-t bias in diffusion models (where the signal-to-noise ratio of samples in the reverse process does not match the timestep) and proposes Dynamic Difference Correction in the Wavelet Domain (DCW). This training-free, plug-and-play method improves the generation quality of various diffusion models.
- DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
-
DDiT discovers that Diffusion Transformers require coarse-grained patches during early denoising but fine-grained ones only during late stages. It adds light-weight LoRA branches to a frozen pre-trained DiT to support multiple patch sizes and utilizes a training-free scheduler to automatically select the largest available patch at each step based on the "acceleration of latent evolution." It achieves up to 3.52× acceleration on FLUX-1.Dev with almost no drop in FID.
- DDT: Decoupled Diffusion Transformer
-
DDT splits the traditional "decoder-only" Diffusion Transformer into a dedicated condition encoder for semantic extraction and a dedicated velocity decoder for velocity field regression. This decouples the optimization conflict between "semantic encoding" and "high-frequency decoding." It achieves a 1.31 FID on ImageNet 256×256 in only 256 epochs (approximately 4× faster than REPA) and further accelerates inference by nearly 3× by leveraging dynamic programming to share highly similar self-conditions across adjacent steps.
- DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
-
DeCo proposes a frequency-decoupled pixel diffusion framework that utilizes a lightweight pixel decoder to process high-frequency details, allowing the DiT to focus on low-frequency semantic modeling. Combined with a frequency-aware flow matching loss, it achieves FID scores of 1.62 (256) and 2.22 (512) on ImageNet, narrowing the gap between pixel-space and latent-space diffusion.
- Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
-
DRDD identifies that injecting Gaussian noise, beyond performing "manifold lifting," also implicitly narrows the feature distribution gap between different domains (acting as a "domain harmonizer"). Consequently, it decouples the traditional coupled diffusion into two independent stages: "noise addition for domain harmonization" followed by "deterministic residual mapping." This ensures the core source \(\to\) target mapping is completed entirely within a fixed noise domain, achieving robustness and data efficiency in unified restoration tasks and scenarios with limited paired data.
- Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
-
Diffusion/flow matching models typically use a single timestep for all patches and distribute computation uniformly. This paper proposes Patch Forcing (PF): assign independent noise levels to each patch during training and learn a lightweight "patch difficulty head." This allows confident (easy) regions to denoise first, providing "future" context for uncertain (difficult) regions. Combined with two difficulty-aware samplers, it reduces the FID of SiT on ImageNet 256² from 17.2 to 9.8 (XL/2, fixed computation).
- Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
-
To address the issues in e-commerce advertising where "images and copy use separate models and rely on group CTR to reflect average preferences," this paper proposes Uni-AdGen, a unified autoregressive model. It integrates ad images and copy into a single next-token prediction workflow for joint generation. It further employs a "coarse-to-fine preference understanding module" to extract personalized interests from noisy multimodal historical clicks and introduces PAd1M, the first large-scale personalized ad dataset, along with a background-sensitive evaluation metric (PBS). The results outperform baselines in both general and personalized settings.
- Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
-
The authors propose learning a "per-prompt and per-noise customized sampling schedule" for frozen text-to-image samplers without modifying model weights. By using a one-shot Dirichlet policy to output the entire schedule in a single forward pass and employing James-Stein shrinkage as a REINFORCE reward baseline to reduce gradient variance, the method improves text-image alignment for SD/Flux at the same step counts. Specifically, it enables Flux to approach distilled Flux-Schnell performance in just 5 steps.
- Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal
-
Aiming at the challenge of "semi-transparent sunglasses" where residual information exists under the lens but is partially occluded, Diff-SemiER employs a Generative Prior Diffusion Branch (GPDM) to reconstruct a structurally sound glass-free face, followed by a Transparency-Aware Fusion Diffusion Branch (TAFDM). Combined with a soft mask, it adaptively fuses "generated content" and "sub-lens real details" across both channel and spatial dimensions. This approach preserves identity and details under varying occlusion levels, outperforming existing methods on both synthetic and real-world datasets.
- DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
-
DiffGraph organizes vast amounts of online diffusion expert models (checkpoints / LoRAs) into a "universal graph." It employs two LLM agents to parse user prompts and dynamically activate subgraphs, utilizing a Variational Graph Autoencoder (VGAE) to predict merging coefficients for each expert. This allows for training-free and test-time-optimization-free on-demand merging of arbitrary experts, leading in human preference metrics on DABench and DiffusionDB.
- Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
-
Addressing the pain points where off-the-shelf CLIP fails to capture fine-grained makeup and holistic injection loses regional controllability, FRAM fine-tunes a specialized "Makeup CLIP Encoder" using synthetic data. It then extracts region-separated makeup features using learnable facial region queries coupled with attention loss. This allows diffusion-based methods to combine makeup from different reference images (e.g., skin/eyes/lips) for the first time, while achieving a superior balance between identity preservation and makeup consistency.
- Diffusion Mental Averages
-
Proposes Diffusion Mental Averages (DMA), which extracts "mental average" prototype images of concepts from pre-trained diffusion models by aligning multiple denoising trajectories in the semantic space—achieving consistent and realistic visualization of conceptual averages for the first time.
- Diffusion Probe: Generated Image Result Prediction Using CNN Probes
-
It was discovered that the cross-attention distribution in the early denoising steps of diffusion models is highly correlated with the final image quality. This paper proposes Diffusion Probe—a lightweight CNN that predicts generation quality from early attention maps. By pre-filtering low-quality generation paths at only 10% of the denoising process, it accelerates prompt optimization, seed selection, and GRPO training.
- DiP: Taming Diffusion Models in Pixel Space
-
Ours proposes DiP, an efficient pixel-space diffusion framework. By utilizing a DiT backbone to model global structures with large patches and a lightweight Patch Detailer Head to recover local details, it achieves computational efficiency comparable to LDMs without requiring a VAE, reaching a 1.79 FID on ImageNet 256×256.
- Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
-
The DisCo framework is proposed to resolve the "similarity-controllability" paradox in subject-driven image generation. It first decouples text and visual information by replacing entity words with pronouns to eliminate textual interference on the subject, and then re-couples them using GRPO with a dedicated reward model.
- DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
-
DiT360 does not focus on model architecture but instead uses "perspective + panoramic hybrid training" to address the scarcity of high-quality real-world panoramic data. It injects cross-domain knowledge via perspective guidance and panoramic refinement at the image level (pre-VAE) and enforces geometric consistency via circular padding, yaw loss, and cube loss at the token level (post-VAE). It achieves state-of-the-art performance on Matterport3D across 11 metrics (notably FID 42.88).
- DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
-
This work adapts a pretrained text-to-image DiT (SANA) into an efficient one-step image compression decoder. Through three alignment mechanisms—variance-guided reconstruction flow (pixel-adaptive denoising intensity), self-distillation alignment (using encoder latents as distillation targets), and latent-conditional guidance (replacing text encoders)—it achieves SOTA perceptual quality (BD-rate DISTS -87.88%) in a 32× downsampled deep latent space. It is 30× faster in decoding and can reconstruct 2K images within 16GB of laptop VRAM.
- DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
-
Addressing the "mode collapse" (uniform faces and compositions) that occurs after applying GRPO to diffusion models for RLHF, DiverseGRPO tackles the issue from both reward modeling and denoising dynamics. It groups samples of the same caption using spectral clustering to issue "exploration rewards" inversely proportional to cluster size, and replaces the late-stage uniform KL regularization with a Wasserstein constraint applied only to early denoising steps. This improves semantic diversity by 13%–18% while maintaining quality, establishing a new Pareto frontier for quality-diversity.
- Diversity over Uniformity: Rethinking Representation in Generated Image Detection
-
The Anti-Feature-Collapse Learning (AFCL) framework is proposed to maintain the diversity and complementarity of discriminative representations. By filtering irrelevant features via an information bottleneck and suppressing excessive overlap between different forgery cues, the method achieves significant improvements in cross-model generated image detection.
- Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
-
Addressing high variance and reward hacking in diffusion RL fine-tuning caused by "uniformly backfilling the final reward to every denoising step," this paper proposes AdaScope. By sensing semantic structural evolution and reward gain trends during denoising, it adaptively performs RL only on middle timesteps where "structure is formed but rewards are still increasing," achieving a 66% performance gain over SOTA while cutting computational costs by 59%.
- DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
-
DPAR utilizes a lightweight entropy model to compute the "next-token prediction entropy" for each image token. Adjacent tokens in low-information regions (e.g., sky, walls) are dynamically merged into variable-length patches, while high-information regions maintain token-level granularity. This allows the decoder-only autoregressive Transformer to perform next-patch prediction on a "reduced number of patches," decreasing token counts by 1.81×/2.06× and training FLOPs by up to 40.4% on ImageNet 256/384, while simultaneously improving FID by up to 29.6%.
- Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
-
Ours formulates diffusion model sampling acceleration as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) and using dynamic programming to select the optimal sequence of key timesteps, ours achieves 4.87× training-free acceleration with generation quality exceeding the full-step baseline.
- DreamOmni2: Multimodal Instruction-based Generation and Editing
-
DreamOmni2 upgrades "instruction-based editing" and "subject-driven generation" into multimodal instruction tasks with reference images, capable of referencing both specific objects and abstract attributes (e.g., texture, pose, hairstyle, style). It generates training pairs via a three-stage synthetic data pipeline and equips the unified editing model Flux Kontext with index encoding, positional encoding offsets, and joint VLM training. This allows the model to ingest multiple reference images and understand complex colloquial instructions, outperforming GPT-4o and Nano Banana in human evaluations on its self-constructed benchmark.
- DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
-
DreamStereo models the "monocular-to-stereo video" conversion as an occlusion inpainting problem. It utilizes gradient-aware backward warping to generate clean training data and a sparse strategy that limits diffusion calculations solely to tokens in occluded regions. This achieves 25 FPS real-time HD stereo inpainting (768×1280) on a single A100 GPU (NFE=1, PSNR 30.5 dB).
- DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
-
DRiffusion formalizes "skipping intermediate timesteps" in diffusion sampling as a local operator. It first uses this operator to draft approximate states for the next \(k\) timesteps at once, feeds these drafts into the original denoising network in parallel to obtain noises, and then refines them along the original trajectory. Without modifying pretrained models or samplers, it achieves a 1.4×–3.7× wall-clock speedup using \(n\) GPUs, while maintaining near-original FID/CLIP scores.
- DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
-
This paper proposes DUO-VSR, a three-stage distillation framework. It compresses multi-step video super-resolution models into a one-step generator through progressive guided distillation initialization, dual-stream distillation (joint optimization of DMD and RFS-GAN), and preference-guided fine-tuning. This achieves approximately 50× acceleration while exceeding the visual quality of previous one-step VSR methods.
- Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
-
Multi-view videos are organized into a "camera \(\times\) time" grid. By leveraging the dual-stream self-attention of MM-DiT, adjacent viewpoint and temporal features are fused simultaneously within local subgrids. This consistency is propagated across the entire grid using token inheritance and flow-guided token replacement. The edited frames are then used to optimize a pre-trained 4DGS without further training.
- DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
-
DynaVid proposes utilizing synthetic optical flow rendered via computer graphics (rather than synthetic video) to train video diffusion models. Through a two-stage framework consisting of a Motion Generator and a Motion-guided Video Generator, it achieves realistic video synthesis of highly dynamic motions and precise camera control.
- DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
-
DynFusion inserts a lightweight gating module, CAM, into each MMDiT block of the DiT architecture. This allows the model to autonomously decide which visual conditions (depth, edge, subject, background, etc.) to activate based on the "current denoising step, task, and injection position." By replacing static "blind stacking of all conditions" with dynamic sparse fusion, it simultaneously achieves better FID, controllability, and reduced inference FLOPs (Subject-Insertion FID 5.14→4.53, FLOPs 16.21T→7.76T).
- Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
-
The EDA framework is proposed to extend the design space of EDM from pure Gaussian noise to arbitrary noise patterns. Flexible noise diffusion is achieved through SDEs driven by multivariate Gaussian distributions and multiple independent Wiener processes, proving that increased noise complexity introduces no additional sampling overhead. With only 5 sampling steps, it achieves performance comparable to or better than 100-step Refusion and specialized methods on MRI bias field correction, CT metal artifact removal, and natural image shadow removal.
- EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
-
EditMGT is the first instruction-based image editing model built on Masked Generative Transformers (MGT). By leveraging the "token-by-token flipping" local decoding characteristic of MGT, it employs multi-layer attention aggregation to localize editing regions and uses region-preserving sampling to revert tokens in low-attention areas back to the original image. This mechanism inherently eliminates "editing leakage" common in diffusion models. With only 960M parameters, it achieves SOTA image similarity across four benchmarks and is 6× faster than comparable models.
- EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
-
The EffectErase framework is proposed, which jointly learns video object insertion as an inverse auxiliary task for removal. A large-scale VOR dataset containing 60K video pairs is constructed to achieve high-quality erasing of objects and their visual side effects, including occlusions, shadows, reflections, lighting changes, and deformations.
- Efficient and Training-Free Single-Image Diffusion Models
-
By treating "all patches within a single image" as a finite dataset, it is demonstrated that the denoising score on this dataset has an analytical closed-form solution (a weighted denoiser similar to non-local-means). This transforms single-image diffusion models into a completely training-free process—matching or exceeding the quality and diversity of SinDDM/SinFusion which require hours of training, while enabling the generation of megapixel images in seconds and gigapixel images in minutes.
- Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
-
To address the challenge of high noise in 4K/8K videos captured by mobile phones under extreme low-light (\(<1\)lx) while meeting strict constraints of \(<33\)ms latency and \(<250\)mA power consumption, this Samsung paper presents an end-to-end engineering solution ranging from "Mixed Dataset Construction → Lightweight mRLFB Denoising Network → Distillation/Re-parameterization/Quantization Optimization." It develops a real-time denoiser that can be directly integrated into commercial ISP pipelines (raw-in/raw-out, preserving CFA), running at 4K@30fps on Snapdragon NPUs with PSNR comparable to heavy SOTA models but with latency and power consumption reduced by an order of magnitude.
- Efficient Weighted Sampling via Score-based Generative Models
-
To address the requirement of "sampling from weighted distributions such as \(w(x)p(x)\)," this paper proposes LAGS: a first-order guidance approximation without second-order derivatives or Hessians added to the score of a pre-trained diffusion model, combined with a single-parameter time scheduler derived from error theory to dynamically adjust guidance strength. Achieving a completely training-free approach, it is 1.2–4.7× faster than SOTA resampling methods on SDXL while achieving higher PickScore.
- EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
-
EgoFlow proposes a generative framework based on Flow Matching that utilizes a Mamba-Transformer-Perceiver hybrid architecture to fuse multimodal scene conditions. During inference, it employs gradient-guided sampling to impose differentiable physical constraints (collision avoidance, motion smoothness), generating physically plausible 6DoF object motion trajectories from egocentric videos, with a collision rate reduction of up to 79%.
- Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
-
This work first extends the MeanFlow framework from class-label conditioning to text-conditional image generation. It discovers that the semantic discriminability and disentanglement of text representations are key bottlenecks under restricted inference steps. Based on the BLIP3o-NEXT text encoder, the authors achieve high-quality few-step and one-step T2I generation.
- EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
-
The EMMA benchmark is proposed to systematically evaluate concept erasure methods for T2I models across five dimensions (erasing ability, retaining ability, efficiency, quality, and bias) with 12 metrics. Covering 206 concept categories across 5 domains, it reveals for the first time the "shallow erasure" nature of existing methods under implicit prompts and the issue of bias amplification.
- EmoStyle: Emotion-Driven Image Stylization
-
EmoStyle proposes the new task of "Affective Image Stylization (AIS)"—rendering a content image into an artistic style that evokes a target emotion using only a single emotion word (e.g., "fear", "awe"). This is achieved via an emotion-content reasoner that fuses emotion and content into style queries, and a style quantizer that discretizes continuous features into "per-emotion" style codebooks, improving the Emo-A metric from ~24% to 33.36%.
- EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution
-
EMR-Diff reformulates the fusion task of "Low-Resolution Hyperspectral Image (LR-HSI) + High-Resolution Multispectral Image (HR-MSI)" into "High-Resolution Hyperspectral Image (HR-HSI)" as a diffusion process. By transferring multimodal residuals instead of pure Gaussian noise between the start and end of the Markov chain, the sampling steps are reduced from thousands to 5. Furthermore, edge information from the HR-MSI is used to modulate the noise, forcing the model to focus on reconstructing high-frequency details. Combined with a dual-branch BAF-UNet, it outperforms over 10 SOTA methods across metrics like PSNR and SAM on the ICVL, Harvard, and Chikusei datasets.
- Enhancing Spatial Understanding in Image Generation via Reward Modeling
-
Ours constructs the 80K adversarial preference dataset SpatialReward-Dataset and trains a specialized reward model, SpatialScore (whose accuracy exceeds GPT-5), to evaluate spatial relationship precision. This model serves as the reward signal for online RL using GRPO with a top-k filtering strategy, significantly enhancing the spatial generation capabilities of FLUX.1-dev.
- Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
-
ETC models each concept as a Student-t Mixture Model (tMM) on text embeddings, uses Affine Optimal Transport (AOT) to map target concepts to an "anonymous" distribution, and automatically samples anchors from distribution boundaries (eliminating manual selection). By employing a MoE-based erasure module, MoEraser, combined with "Noise Injection-Recovery" training, it erases 2000+ cross-domain concepts on SDv1.4 / SDv3.5-L in one go while resisting "module deletion" white-box attacks, achieving SOTA in both scale and precision.
- Evaluating Generative Models via One-Dimensional Code Distributions
-
The evaluation of generative models is shifted from "continuous recognition features" to "discrete visual tokens." By using a 1D tokenizer to quantize images into token sequences, the authors design a training-free distribution distance (CHD) and a self-supervised no-reference quality score (CMMS). Both achieve state-of-the-art correlation with human judgment across multiple preference benchmarks.
- Exploring Conditions for Diffusion Models in Robotic Control
-
This paper explores how to use the conditioning mechanism of pre-trained text-to-image diffusion models to generate task-adaptive visual representations for robotic control. It finds that text conditions are ineffective in control environments due to domain gaps. The proposed ORCA framework introduces learnable task prompts and per-frame visual prompts as conditioning mechanisms, achieving SOTA on 12 tasks across DMC, MetaWorld, and Adroit benchmarks.
- Exploring Spatial Intelligence from a Generative Perspective
-
This paper introduces the concept of "Generative Spatial Intelligence" (GSI)—the ability of unified multimodal models to adhere to and manipulate 3D spatial constraints during image generation. The authors construct the first quantitative benchmark, GSI-Bench (comprising the real-world set GSI-Real and the synthetic set GSI-Syn), evaluated via space-anchored image editing tasks. Furthermore, it demonstrates that fine-tuning BAGEL solely on synthetic editing data significantly improves generative spatial editing and, crucially, transfers back to enhance the model's spatial "understanding" capabilities.
- ExpPortrait: Expressive Portrait Generation via Personalized Representation
-
Ours proposes high-fidelity personalized head representations (static identity offsets + dynamic expression offsets) to address the limited expressiveness of parametric models like SMPL-X. Combined with an identity-adaptive expression transfer module and a DiT generator, it achieves SOTA performance in both portrait video self-driven and cross-identity reenactment tasks.
- FabricGen: Microstructure-Aware Woven Fabric Generation
-
FabricGen decouples woven fabric generation into two paths: "macro texture" and "micro weaving structure." Specifically, a diffusion model fine-tuned on microstructure-free data generates albedo maps without microstructure, while a Language Model (WeavingLLM) designs weaving drafts and yarn parameters directly from text. These drive an enhanced procedural geometry model to synthesize yarn-level microstructures. The final fused rendering produces realistic fabrics with far richer details than previous methods while adhering to physical weaving rules.
- Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
-
The Face2Scene two-stage framework is proposed: it first utilizes a reference-based face restoration model (Ref-FR) to obtain HQ-LQ face pairs, from which a degradation code is extracted as an "oracle." This code then conditions a single-step diffusion model to complete full-scene image restoration, including the body and background.
- FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration
-
Instead of passively scoring T2I models with fixed prompt sets, this work formalizes "error finding" as a structured tree search over an entity × attribute combinatorial space. By utilizing rule-based pruning and learned prioritization, the astronomical search space is reduced to a feasible scale. The framework automatically uncovers 247,000 previously unknown "minimal failure slices" on SD1.5 and provides large-scale evidence correlating these failures with data scarcity in training sets.
- FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
-
FaithFusion reformulates the pixel-level decision of "whether and how much to edit" as Expected Information Gain (EIG). The same EIG signal serves both to guide diffusion models—restricting generation to high-uncertainty regions—and as pixel-wise loss weights to distill generated content back into 3DGS. This maintains geometric fidelity and appearance controllability under large perspective shifts, such as lane changes. Ours achieves SOTA results on Waymo for NTA-IoU, NTL-IoU, and FID (retaining an FID of 107.47 even during a 6-meter lane change).
- FARMER: Flow AutoRegressive Transformer over Pixels
-
FARMER integrates invertible Autoregressive Flows (AF) and Autoregressive Transformers (AR) into an end-to-end framework, performing generation and exact likelihood estimation directly on raw pixels. By using AF to transform images into latent sequences and AR to implicitly model the distribution of these sequences—supported by self-supervised dimensionality reduction, one-step distillation, and resampling-based CFG—it reduces the FID on ImageNet \(256 \times 256\) from 6.64 (JetFormer) to 3.60.
- FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
-
Addressing the bottleneck of slow diffusion denoising in "Autoregressive + Diffusion Head" hybrid image generation, FastHybrid utilizes a lookahead branch to parallelly pre-decode several future tokens and an autoregressive branch to verify/correct them via cosine similarity. By employing guided diffusion sampling, the denoising steps for verified tokens are compressed from 100 to 10, achieving up to 1.97× inference acceleration for MAR without training, with an FID degradation of only approximately 0.11.
- FEAT: Fashion Editing and Try-On from Any Design
-
FEAT integrates "obtaining design inspiration from any image (artistic paintings, natural photos, abstract images)" and "performing virtual try-ons for full outfits (including accessories)" into a single diffusion framework. It utilizes Disentangled Dual Injection (DDI) to separate content (shapes/contours) and style (color/texture) in image prompts, injecting them into different U-Net attention blocks to suppress content leakage. Furthermore, it employs a training-free Orthogonal-Guided Noise Fusion (OGNF) mechanism to remove original clothing via orthogonal projection and applies distinct noise strategies to three regions, surpassing existing methods in sketch fidelity, prompt consistency, and realism.
- Few-shot Acoustic Synthesis with Multimodal Flow Matching
-
Ours proposes FLAC, the first few-shot Room Impulse Response (RIR) generation framework based on flow matching. It synthesizes spatially consistent acoustic responses in unseen scenes from a single recording and introduces AGREE joint embedding for geometric-acoustic consistency evaluation.
- Few-Step Diffusion Sampling Through Instance-Aware Discretizations
-
Addressing the sub-optimal issue of "sharing a single set of time-step discretizations for all samples" in diffusion/Flow Matching sampling, this paper proposes INDIS: training a lightweight network \(\phi(\mathbf{x}_T, \mathbf{c})\) to generate instance-specific discretizations for each initial noise and condition. With nearly zero inference overhead, it significantly reduces FID for 3~7 step sampling (e.g., CIFAR10 NFE=3 FID drops from 16.5 to 9.3).
- FG-Portrait: 3D Flow Guided Editable Portrait Animation
-
This paper proposes FG-Portrait, which introduces "3D optical flow" directly computed from FLAME parametric 3D head models as a learning-free geometric motion correspondence. Combined with depth-guided sampled 3D optical flow encoding as the motion condition for a diffusion-based ControlNet, it significantly improves motion transfer accuracy (reducing APD by 22%+) and supports inference-time editing of expressions and head poses.
- Fine-Grained GRPO for Precise Preference Alignment in Flow Models
-
G²RPO (Granular-GRPO) transforms the sparse reward paradigm in flow-based GRPO training—where SDE noise is injected at every step and terminal rewards are averaged across the trajectory—into a "Singular Stochastic Sampling" approach where randomness is injected only at one step while others follow deterministic ODEs. By calculating and fusing advantages across multiple denoising granularities for the same direction, it provides precise and comprehensive reward signals. On Flux.1-dev, it outperforms DanceGRPO and MixGRPO across various in-/out-domain metrics including HPS, ImageReward, and Unified Reward.
- FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
-
FINE is a pre-training method for diffusion models: it formulates weights of each layer as \(U_\star \Sigma^{(l)}_\star V_\star^\top\). The shared singular vectors \(U_\star, V_\star\) (termed learngene) carry size-agnostic knowledge, while layer-specific singular values \(\Sigma^{(l)}_\star\) adapt to each layer. For any target size, one can directly initialize by freezing the learngene and performing lightweight retraining of \(\Sigma\) (approx. 0.3K steps vs. 300K steps for full pre-training). It reduces FID by up to 4.89 on ImageNet for variable-depth DiT models.
- FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
-
Addressing the issue in real-time video generation where "denoising is fast enough, but convolutional decoders have become the bottleneck," FlashDecoder uses a pure Transformer decoder to decode latents into pixels frame-by-frame. By looking only at the recent \(W_{\text{frm}}\) frames via a fixed-length rolling KV cache, it achieves constant latency and bounded memory regardless of video length. On 1080p, it matches convolutional decoder reconstruction quality (41.55 vs. 41.49 dB PSNR) while being 3.6×–4.7× faster in throughput and saving up to 11× memory.
- Flow Map Distillation Without Data
-
Conventional methods for distilling pretrained flow/diffusion teachers into "one-step" flow maps require sampling from external datasets, which this paper identifies as causing Teacher-Data Mismatch (where the data distribution differs from the teacher's true generation distribution). This work proposes sampling exclusively from the prior noise and using a "predictor-corrector" dual objective to keep the student on the teacher's vector field. This approach achieves FID scores of 1.45 and 1.49 on ImageNet 256/512 with a 1-NFE student, outperforming all data-based distillation baselines.
- Flow Matching for Multimodal Distributions
-
When adopting a vision foundation model (DINOv2-B) as a tokenizer, the latent space naturally exhibits a multimodal "union of manifolds" structure. This paper uses a Gaussian Mixture Model (GMM) fitted to the target distribution as the source distribution and performs data pairing based on the "nearest mode" (mode coupling). This ensures that probability mass is transported locally, accelerating flow matching training convergence by 30×, reducing sampling steps to 1/5, and achieving FID=2.74 on unconditional ImageNet256 generation (80 epochs).
- FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
-
FlowDC decomposes complex target prompts with multiple editing goals into a sequence of progressive sub-prompts. It calculates "editing directions" for each goal along parallel trajectories and orthogonalizes them into a basis. By projecting the original editing velocity onto this basis, it retains components within the subspace and decays components orthogonal to the editing directions, achieving multi-target semantic alignment and source image consistency in a single round.
- FlowFixer: Towards Detail-Preserving Subject-Driven Generation
-
FlowFixer is a model-agnostic, prompt-free refiner. It does not re-generate the scene; instead, it takes images from any Subject-Driven Generation (SDG) model as input alongside the original subject image as a reference. Using a pure image-to-image dual-stream diffusion process, it restores high-frequency details such as lost logos, text, and textures. Training data is synthesized via "one-step denoising" to create self-supervised pseudo-pairs where only details are degraded while layouts remain intact. Coupled with ground-truth-free metrics (AKI / KGain) based on keypoint matching, it achieves new SOTA subject fidelity across three mainstream SDG backbones (average KGain 77.3%, dominating competitors in human preference).
- FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories
-
FlowSteer supplements the neglected ReFlow/PeRFlow few-step distillation path by guiding the student model along the teacher's authentic generation trajectories (instead of linear interpolation). By integrating Online Trajectory Alignment (OTA), trajectory-level adversarial distillation, and a rectified scheduler, it achieves 4-step generation quality on SD3 that surpasses mainstream methods like PCM, Hyper-SD, and Flash Diffusion.
- FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
-
FontCrafter reformulates artistic font generation as a visual in-context generation task. By concatenating reference element images with a blank canvas and feeding them into a pre-trained inpainting model (FLUX.1-Fill), it achieves high-fidelity element-driven font creation, significantly outperforming existing methods in texture and structural fidelity.
- Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers
-
This work makes a key observation regarding training-free feature caching for Diffusion Transformers (DiT): in the feature space, only the low-rank principal subspace evolves smoothly and predictably over time, while the high-frequency residual subspace is jittery and hard to forecast. Consequently, SVD is employed to decompose features into two parts: EMA extrapolation is applied to the principal subspace, while the residual is directly reused. This achieves an nearly lossless 5.55× speedup on FLUX and HunyuanVideo.
- FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
-
FRAMER proposes a frequency-aligned self-distillation training framework that uses final-layer feature maps as teachers to supervise intermediate layers. By applying IntraCL and InterCL contrastive losses for low-frequency (LF) and high-frequency (HF) components respectively, combined with Frequency-Adaptive Weighting (FAW) and Frequency-Alignment Masking (FAM), the method significantly enhances high-frequency detail restoration in diffusion models for real-world image super-resolution without altering architecture or inference workflows.
- FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
-
FreqEdit identifies the root cause of "performance breakdown in multi-turn instruction-based editing" as the continuous loss of high-frequency information during iterations. It constructs a reference velocity field from context images in early denoising stages and injects its high-frequency wavelet components into the editing velocity field in a spatially adaptive manner. Coupled with path compensation and quality guidance, this training-free framework enables FLUX.1 Kontext and Qwen-Image to perform stable editing for 10+ turns without geometric distortion.
- Frequency-Aware Flow Matching for High-Quality Image Generation
-
FreqFlow introduces explicit frequency-aware conditions into the flow matching framework. By employing a dual-branch architecture to separately process low-frequency global structures and high-frequency details, it achieves SOTA performance with a 1.38 FID on ImageNet-256.
- From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
-
This paper observes the intrinsic link between image layer decomposition and image inpainting/outpainting tasks. It proposes the Outpaint-and-Remove method, which efficiently adapts a pre-trained inpainting DiT model (FLUX.1-Fill-dev) into a layer decomposition model via lightweight LoRA fine-tuning. By introducing a multimodal context fusion module to preserve details and using only 100,000 synthetic training samples, it achieves SOTA performance.
- From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
-
Addressing the issue that T2I-oriented Image-CoT wastes computational power when directly applied to image editing, this paper proposes ADE-CoT. It dynamically allocates sampling budgets based on editing difficulty, replaces generic MLLM scoring with specialized "edit region + instruction consistency" verifiers for early pruning, and employs a depth-first "stop when sufficient" mechanism to eliminate redundant sampling. ADE-CoT achieves better image quality while accelerating inference by over 2× compared to Best-of-N across three SOTA editing models.
- From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution
-
Fresco replaces the fragmented stage-by-stage re-noising in traditional dynamic resolution sampling with a "coordinate-bound unified noise field" + "token variance-adaptive progressive upsampling." This ensures that low-resolution sketches and high-resolution refinements converge toward the same target. It is training-free and accelerates FLUX by 10× and HunyuanVideo by 5×. It is orthogonal to distillation/feature caching, reaching up to 22× speedup when combined.
- Functional Mean Flow in Hilbert Space
-
This work extends the "one-step generation" of Mean Flow from finite-dimensional Euclidean space to infinite-dimensional Hilbert (functional) space. By reconstructing the training target for the average velocity field using Fréchet derivatives of two-parameter flows and introducing a more stable x1-prediction variant, it enables high-quality single-step sampling for various functional data types, including time series, images, PDEs, and 3D shapes.
- Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
-
DPOFusion adapts Direct Preference Optimization (DPO) from LLMs for infrared-visible image fusion. It first utilizes an attribute-aligned latent diffusion model to generate diverse fusion candidates, then applies "instance-level DPO" to fine-tune preferences only within regions of interest while enforcing consistency with a reference model elsewhere. This single framework simultaneously satisfies four types of heterogeneous preferences: human, VLM, detection, and segmentation.
- FVAR: Next-Focus Prediction for Visual Autoregressive Modeling
-
FVAR reformulates the "next-scale prediction" of Visual Autoregressive (VAR) modeling into "next-focus prediction." By constructing a pyramid from blur to clarity using physically consistent defocus kernels, it eliminates aliasing (jaggies/Moiré patterns) caused by uniform downsampling at the source. Additionally, a high-frequency residual teacher, present only during training, distills aliasing information into the original VAR student network, achieving zero extra inference overhead while outperforming VAR / M-VAR on ImageNet across all FID metrics.
- Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
-
Ours proposes Garments2Look, the first large-scale multimodal outfit-level virtual try-on dataset (80K pairs, 40 categories, 300+ subcategories). Each group contains 3-12 reference garment images, a model outfit image, and detailed text annotations, revealing significant deficiencies of existing methods in multi-layer styling and accessory consistency.
- Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
-
Aiming at the issues where ControlNet fails to adapt to non-aligned conditions and OminiControl converges extremely slowly on spatially aligned tasks when applied to linear-attention diffusion models (e.g., SANA), this paper proposes GateControl. By utilizing a "shared backbone + unified intra-block interaction + a 0.09M parameter token-level gate," the method speeds up the convergence of spatial tasks by over 10× while adding only approximately 1.18% trainable parameters. It provides unified support for both spatially aligned (Canny/Depth/Colorization) and non-aligned (Subject-driven) conditions.
- GDRO: Group-level Reward Post-training Suitable for Diffusion Models
-
GDRO adapts the Group Relative Policy Optimization (GRPO) alignment strategy from LLMs to rectified flow diffusion models. By utilizing a DPO-style "implicit reward function" to calculate rewards directly at arbitrary noise timesteps, it achieves fully offline training (eliminating the need for iterative online sampling) and remains sampler-agnostic (avoiding ODE-to-SDE approximations). GDRO approaches or exceeds Flow-GRPO in OCR and GenEval text-to-image tasks with 2–3.7× higher efficiency while significantly mitigating reward hacking.
- GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation
-
GenColorBench is the first benchmark to systematically evaluate the "color accuracy" of text-to-image (T2I) models. It constructs 44,000 prompts across five color tasks using ISCC-NBS / CSS3-X11 color systems and RGB/hex values. By employing an evaluation pipeline based on "Color Science dominant colors + \(\Delta E\)" that does not rely on VLMs, the study reveals that current SOTA models are generally weak in precise color control (failing to exceed 50% accuracy in most tasks).
- GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers
-
GeoRelight integrates "portrait relighting" and "3D geometric reconstruction" into a single multi-modal Diffusion Transformer for joint denoising. By utilizing iNOD—a VAE-friendly, distortion-free depth representation that enables 3D geometry to enter the latent space—and a mixed training strategy combining synthetic and auto-labeled real data to bridge the sim-to-real gap, the model produces photorealistic relighting, intrinsic albedo, surface normals, and high-fidelity 3D shapes from a single image. It outperforms specialized SOTA methods across relighting, geometry, and intrinsic estimation tasks.
- GeoRK2: Geometry-Guided Runge-Kutta Integration for Diffusion Transformer Acceleration
-
GeoRK2 reformulates the few-step sampling of Diffusion Transformers as "second-order Runge-Kutta integration on a Riemannian manifold induced by feature covariance." It replaces the default numerical updates of existing samplers with a training-free, plug-and-play "Predictor-Corrector" module, achieving 4–5× acceleration on ImageNet/FLUX/HunyuanVideo with almost no quality loss (\(\Delta\)FID≈0.81).
- GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
-
Ours proposes GlyphPrinter, which significantly improves glyph accuracy in visual text rendering without relying on an explicit reward model by constructing the region-level glyph preference dataset GlyphCorrector and the Region-Grouped DPO (R-GDPO) objective function, while introducing inference-time Regional Reward Guidance for controllable generation.
- Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
-
The GenReward framework is proposed to utilize pre-trained video diffusion models for generating goal-conditioned videos. It guides reinforcement learning agents through two-tier goal-driven reward signals at the video and frame levels, significantly outperforming baselines on Meta-World robotic manipulation tasks without manual reward function design.
- gQIR: Generative Quanta Image Reconstruction
-
Adapt large-scale text-to-image latent diffusion models to extreme photon-starved imaging scenarios of Single-Photon Avalanche Diodes (SPADs). Through a three-stage framework (Quanta-aligned VAE → Adversarially fine-tuned LoRA U-Net → FusionViT spatio-temporal fusion), the method achieves high-quality RGB reconstruction from sparse binary photon detections, significantly surpassing all existing methods under extreme 10K-100K fps conditions.
- GrOCE: Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
-
GrOCE introduces a training-free concept erasure framework based on dynamic semantic graphs. By integrating three synergistic components—semantic graph construction, adaptive cluster identification, and selective severance—it achieves precise, context-aware online removal of target concepts within text-to-image diffusion models.
- Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
-
Diffusion models have traditionally generated images independently during inference. This paper enables a group of semantically similar images to "reference" each other's patches via cross-sample attention during denoising. With only a token reshaping modification, it improves the FID of SiT-XL/2 on ImageNet-256 by 32.2%.
- Group Editing: Edit Multiple Images in One Go
-
Ours proposes GroupEditing, which reconstructs a set of related images as pseudo-video frames. By combining explicit geometric correspondences provided by VGGT with implicit temporal priors from video models through enhanced positional encodings (Ge-RoPE and Identity-RoPE), it achieves cross-view consistent group image editing, significantly outperforming existing methods in visual quality, editing consistency, and semantic alignment.
- GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
-
This paper discovers that FlowGRPO exhibits a systematic left-shift in importance ratio distributions and inconsistent variance across denoising steps when fine-tuning flow matching models. This causes PPO clipping to fail completely for "overconfident positive samples," leading the model into implicit reward hacking. GRPO-Guard introduces RatioNorm to standardize the ratio back to a mean of 1 and uses \(1/dt\) gradient reweighting to balance step-wise gradients. Without relying on heavy KL regularization, it significantly mitigates over-optimization and preserves generation quality.
- Guiding a Diffusion Model by Swapping Its Tokens
-
This paper introduces Self-Swap Guidance (SSG), a condition-independent sampling guidance method for diffusion models. By selectively swapping the most semantically dissimilar token pairs within the model's intermediate representation space to construct a perturbed version, SSG generates high-fidelity images stably across a wider range of guidance scales than methods like SAG/PAG/SEG, achieving state-of-the-art FID in both conditional and unconditional generation.
- Guiding a Diffusion Transformer with the Internal Dynamics of Itself
-
This paper proposes Internal Guidance (IG), which adds auxiliary supervision losses to the intermediate layers of a Diffusion Transformer to produce weaker generative outputs. During sampling, it extrapolates the difference between intermediate and deep layer outputs to achieve guidance effects similar to Autoguidance without extra sampling steps or external model training. On ImageNet 256×256, it pushes the FID of LightningDiT-XL/1 to 1.34 (w/o CFG) and 1.19 (+CFG), reaching the current SOTA.
- Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
-
To address the issues of coarse conditioning and insufficient fidelity/diversity in synthetic data generated by pre-trained diffusion models under One-Shot Federated Learning (OSFL), this paper proposes Espresso. The method performs intra-class clustering on the client side to learn fine-grained conditional embeddings for each sub-pattern. It further utilizes GMMs to model the latent space initial noise distribution and introduces Z-Sampling, a self-reflective sampling strategy, to fully inject conditional semantics into the generation process. This approach achieves SOTA global model accuracy on three heterogeneous datasets: DomainNet, PACS, and NICO++.
- Guiding Diffusion Models with Semantically Degraded Conditions
-
Condition-Degradation Guidance (CDG) is proposed to replace the null prompt \(\emptyset\) in CFG with a semantically degraded condition \(\boldsymbol{c}_{\text{deg}}\). This shifts the guidance from a coarse "good vs. empty" contrast to a fine-grained "good vs. slightly worse" comparison. By employing a hierarchical degradation strategy (degrading content tokens followed by context-aggregating tokens) to construct adaptive negative samples, the method achieves plug-and-play improvements in compositional generation accuracy across models such as SD3, FLUX, and Qwen-Image with near-zero additional overhead.
- Guiding Token-Sparse Diffusion Models
-
Addressing the issue that "diffusion models trained with token sparsity barely respond to CFG," this paper proposes Sparse Guidance (SG). During inference, two conditional predictions are computed using different token sparsity rates—one strong and one weak. The "capacity gap" between them replaces the unconditional branch in CFG to guide the generation. Without any dense fine-tuning, it achieves a 1.58 FID on ImageNet-256 while saving 25% FLOPs, proving equally effective on 2.5B text-to-image models.
- Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer
-
This paper treats "image atmosphere" as a third conditioning modality for music style, proposing a multimodal music style transfer framework based on inversion-free flow editing. Visual and textual cues are injected into an audio DiT backbone via a CLIP+ViT dual encoder with cross-adapters. A differentiable normalized chroma constraint is used to "pull back" the pitch structure along the flow trajectory, effectively preserving the source melody while allowing large-scale style changes. Indicators such as FAD and IMSM comprehensively outperform existing text- or audio-conditioned methods.
- Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
-
Harmony co-trains the joint generation task with two unidirectional auxiliary tasks using clean signals (audio-driven video and video-driven audio). By incorporating a decoupled interaction module that separates coarse style from fine-grained temporal alignment, alongside SyncCFG—which utilizes "silence/stillness" as negative anchors to amplify synchronization signals—Harmony achieves the first stable open-source breakthrough in precise lip-sync and motion-sound correspondence, outperforming Ovi and UniVerse-1.
- Heterogeneous Decentralized Diffusion Models
-
A heterogeneous decentralized diffusion framework is proposed, allowing different experts to be trained completely independently using distinct diffusion objectives (DDPM \(\epsilon\)-prediction and Flow Matching velocity-prediction). During inference, these are unified into the velocity space for fusion via a deterministic schedule-aware transformation. This approach improves both FID and generation diversity compared to homogeneous baselines while compressing the computational load by 16x.
- HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
-
Addressing complex prompts with "multi-subject + hierarchical attributes," HiCoGen moves away from monolithic single-step generation. Instead, it utilizes an LLM to decompose prompts into minimal semantic units following a "Chain of Synthesis (CoS)," where each step generates a unit using previous images as visual context. Combined with Group Relative Policy Optimization (GRPO) featuring hierarchical rewards and a decaying stochasticity schedule, it significantly improves concept coverage (Acc\(_{exist}\) 0.71) and compositional accuracy over existing T2I/subject-driven models.
- HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
-
The HiFi-Inpaint framework is proposed, utilizing Shared Enhanced Attention (SEA) to leverage high-frequency information for enhancing product detail features, combined with Detail-Aware Loss (DAL) for pixel-level high-frequency supervision, achieving SOTA detail fidelity in human-product image generation.
- High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
-
Proposes an ID-constrained attribute tuning framework for diffusion-based face swapping: the approach first constrains the identity solution space, then injects attribute conditions, and finally performs end-to-end refinement using identity and adversarial losses. Combined with a decoupled condition injection design, it achieves SOTA FID (3.61) and identity retrieval accuracy (97.9% Top-1) on FFHQ.
- High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
-
CCVTON utilizes a unified diffusion Transformer to simultaneously learn "try-off" and "try-on" tasks. It organizes massive unlabeled real-world portraits into a "de-clothing and re-clothing" reconstruction cycle for training, thereby eliminating dependence on scarce paired data. Complemented by a two-stage garment-aware masking mechanism to suppress original garment leakage, it achieves SOTA performance on VITON-HD and DressCode.
- Hist2Style: Histogram-Guided Stylization with Bilateral Grids
-
Hist2Style distills a large image editing model into a lightweight network with only 1.5M parameters. By utilizing "bilateral grids + color histogram conditioning," it constrains style transfer to locally affine tone/color transformations. This approach preserves content structure and eliminates hallucinations while achieving real-time performance at high resolutions and enabling interactive color grading via direct histogram manipulation.
- HP-Edit: A Human-Preference Post-Training Framework for Image Editing
-
This paper proposes HP-Edit, a human-preference post-training framework for image editing. It fine-tunes a VLM-based automatic scorer, HP-Scorer, using a small amount of human-scored data to construct preference datasets and serve as a reward model. Through online Flow-GRPO post-training, pre-trained editing models (e.g., Qwen-Image-Edit-2509) are aligned with human preferences. The authors also release the RealPref-50K dataset and RealPref-Bench benchmark.
- Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
-
This paper proposes IPRO, which directly optimizes video diffusion models using reinforcement learning and a differentiable facial identity scorer. Without modifying the model architecture, it significantly improves facial identity consistency in image-to-video generation, achieving a 20%-45% increase in FaceSim on Wan 2.2.
- IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
-
Proposes IDperturb, a geometric sampling strategy that performs angular perturbations on identity embeddings on the unit hypersphere. It significantly enhances the intra-class diversity of synthetic face datasets without modifying generative models, thereby improving downstream face recognition performance.
- Image Diffusion Preview with Consistency Solver
-
This paper proposes the Diffusion Preview paradigm and ConsistencySolver—a lightweight high-order ODE solver trained via reinforcement learning. It generates high-quality preview images during low-step sampling while ensuring consistency with full-step outputs. It achieves an FID comparable to Multistep DPM-Solver with 47% fewer steps and reduces user interaction time by nearly 50%.
- Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
-
This paper proposes DreamPRVR, which adopts an "imagine before concentration" coarse-to-fine strategy: global semantic register tokens are generated via a truncated diffusion model under text supervision and then fused into fine-grained video representations. This effectively suppresses local noisy responses and achieves SOTA on three PRVR benchmarks.
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
-
iMontage transforms a pretrained video diffusion model (HunyuanVideo) into a unified generator that accepts an arbitrary number of reference images and generates multiple high-dynamic output images based on instructions. By utilizing a Marginal RoPE (treating input/output images as "pseudo-frames" at opposite ends of a sequence) that requires minimal modification to the original network, the model preserves motion priors while breaking the dynamic limitations of continuous frames, achieving state-of-the-art open-source performance in image editing, multi-to-one generation, and storyboard generation.
- Improved Mean Flows: On the Challenges of Fastforward Generative Models
-
The paper diagnoses two root causes of failures in MeanFlow (a one-step generation framework): the training objective's dependence on the network itself and the hard-coded CFG guidance scale before training. By rewriting the objective as a network-independent v-loss using the predicted marginal velocity as the JVP input, and treating the guidance scale as a variable condition injected via multi-token in-context conditioning, the proposed iMF achieves a 1.72 FID on ImageNet 256×256 with a single function evaluation (1-NFE) trained from scratch. This represents an approximately 50% relative improvement over the original MeanFlow, approaching the performance of multi-step methods without any distillation.
- Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
-
This paper identifies that inheriting the base model's \(\epsilon\)-supervision loss for controllable generation methods like ControlNet is suboptimal. Since \(\epsilon\)-loss is equivalent to \(x_0\)-loss weighted by Signal-to-Noise Ratio (SNR), it effectively assigns near-zero weight to early denoising steps that determine global layout. By switching to direct supervision of the clean image \(x_0\) (removing this weighting), convergence speed increases by up to 2× (measured by the proposed mAUCC metric) across ControlNet, T2I-Adapter, GLIGEN, and OminiControl, while simultaneously improving image quality and control fidelity.
- Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
-
This paper unifies guidance methods in diffusion sampling under a "weak-to-strong (W2S)" perspective, categorizing them into "Condition-Dependent Guidance (CDG, e.g., CFG)" and "Condition-Agnostic Guidance (CAG, e.g., AG/SLG)". By characterizing their respective effective intervals through synthetic experiments, the authors propose SGG (Segmented Guidance), which switches between the two guidance types based on noise levels. This principle is further migrated into the training objective to enhance the inherent generalization capabilities of guidance-free models.
- IncreFA: Breaking the Static Wall of Generative Model Attribution
-
This paper redefines the static classification problem of "identifying which generative model produced an image" as incremental attribution. By encoding the "lineage" of generative models with hierarchical orthogonal priors and using a latent space memory bank for replay and synthesizing pseudo-unseen samples, the system evolves continuously without forgetting. It achieves SOTA attribution accuracy and a 98.93% unseen detection rate on the new IABench benchmark covering 28 generative models.
- InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
-
InnoAds-Composer is proposed, a single-stage e-commerce poster generation framework based on MM-DiT. It maps product subjects, glyph texts, and background styles into a unified space via tokenization. By combining a Text Feature Enhancement Module (TFEM) and an importance-aware condition injection strategy, it achieves high-quality generation while significantly reducing inference overhead.
- InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
-
This work distills the editing capabilities of a single-image instruction editor (InstructPix2Pix) into a pre-trained multi-view diffusion model (SEVA) via Score Distillation Sampling (SDS). The latter's data-driven 3D prior serves as an "integrator," enabling consistent cross-view image editing even with only a few sparse views.
- Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
-
This work reveals that a majority of neurons (~81%) in Sparse Autoencoders (SAEs) suffer from insufficient interpretability or steerability. It proposes the CB-SAE framework—by pruning low-utility SAE neurons and integrating a concept bottleneck module, it improves interpretability by +32.1% and steerability by +14.5% in LVLM and image generation tasks, respectively.
- Intrinsic Concept Extraction Based on Compositional Interpretability
-
HyperExpress proposes the new task of Compositional Interpretability Intrinsic Concept Extraction (CI-ICE), leveraging the hierarchical modeling capabilities of hyperbolic space and a horosphere projection module to extract composable object-level and attribute-level concepts from a single image, achieving reversible decomposition of complex visual concepts.
- IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework
-
IntroSVG treats a unified VLM as both a "Generator" and a "Critic," enabling it to render its own SVG code during inference, evaluate scores based on visual feedback, and refine the output. Combined with a training pipeline involving "constructing training data from failed samples + DPO alignment," it achieves SOTA performance across multiple Text-to-SVG metrics (RSR 99.26%, FID 26.18, Aesthetic 4.89).
- It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
-
To address the mode collapse problem where text-to-image diffusion models produce nearly identical outputs for the same prompt across different samples, this paper proposes an end-to-end gradient optimization of initial noise to push a set of samples apart. Combined with a "Pink Noise" initialization that concentrates energy in low frequencies, the method significantly enhances generation diversity without modifying the model or prompt, and with minimal impact on image quality.
- Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
-
Ours proposes the Just-in-Time (JiT) framework, which dynamically selects sparse anchor tokens in the spatial domain to drive the evolution of the generation ODE. By designing a deterministic micro-flow to ensure seamless activation of new tokens, it achieves up to 7× acceleration on FLUX.1-dev with almost no loss in quality.
- Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
-
By introducing an additional scalar "editing strength" input to an instruction-based image editing model (Flux Kontext), this work utilizes a lightweight projection network to map the combination of strength and instruction into offsets within the DiT modulation space. This enables any edit to transition smoothly from "no change" to "full edit" without requiring separate training for each individual attribute.
- LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
-
The authors propose the LacTok tokenizer, which aligns discrete visual tokens with the compact latent space of a pretrained LDM. By utilizing consistency models to compress LDM decoder multi-step sampling into 1-2 steps for pixel-level supervision, it reconstructs or generates 1024×1024 images using only 256 tokens (16× more compression than VQGAN). An autoregressive transformer is then integrated to form the text-to-image model, LacTokGen.
- Language-Free Generative Editing from One Visual Example
-
This paper reveals that text-guided diffusion models suffer from severe text-visual alignment failures regarding simple visual transformations such as rain, fog, and blur. It proposes the VDC framework, which learns pure visual conditioning signals to guide diffusion editing using only a single pair of visual examples (before and after transformation). The method requires no text and no training, surpassing both text-based and fine-tuning methods in tasks like deraining, dehazing, and denoising.
- LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
-
LaRP transforms a pre-trained 2D diffusion inpainting model into an "innately 3D-aware" multi-view inpainter. It clones a UNet encoder to process clean reference views and reprojects reference features onto the target view coordinates using camera poses estimated by a 3D foundation model. These features are injected into the decoder to guarantee 3D consistency at the source. The resulting images allow for NeRF training using only standard reconstruction loss, achieving quality comparable to SOTA while being approximately 50× faster.
- Latent Diffusion Inversion Requires Understanding the Latent Space
-
This paper identifies that memorization in Latent Diffusion Models (LDM) is spatially non-uniform within the latent space—samples or dimensions where the VAE decoder's pullback metric exhibits larger local distortion are memorized more strongly. Accordingly, a filtering method is proposed that ranks dimensions and masks "low-memorization" ones based solely on VAE geometry. This approach consistently improves AUROC by 1–4% and TPR@1%FPR by 1–32% across six datasets and four types of Membership Inference Attacks (MIA).
- Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
-
It is observed that real images exhibit stable inter-layer transitions in the intermediate representations of a frozen CLIP ViT, whereas synthetic images show significant attention mutations. The Layer Transition Discrepancy (LTD) method is proposed to model this difference, achieving a mean Acc of 96.90% on UFD, 99.54% on DRCT-2M, and 91.62% on GenImage, outperforming current SOTAs.
- LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
-
The paper proposes LeapAlign, which shortens long generation paths into two-step jump trajectories. This enables reward gradients to backpropagate directly to early generation steps. Combined with trajectory similarity weighting and gradient discounting strategies, it achieves efficient post-training alignment for flow matching models.
- Learnability-Guided Diffusion for Dataset Distillation
-
Proposes LGD, a learnability-driven incremental dataset distillation framework that constructs the distilled dataset in stages. Each stage generates training samples that are complementary rather than redundant by conditioning on the current model state. By injecting learnability gradient guidance during diffusion sampling, it reduces inter-sample information redundancy (which is 80-90% in existing methods) by 39.1%. It achieves 60.1% accuracy on ImageNet-1K (50 IPC) and 87.2% on ImageNette (100 IPC).
- Learning Latent Proxies for Controllable Single-Image Relighting
-
LightCtrl is a diffusion-based framework for single-image relighting. It utilizes a few-shot latent proxy encoder to provide lightweight material-geometry priors, a lighting-aware mask to guide spatially selective denoising, and DPO post-training to enhance physical consistency. This enables precise, continuous control over lighting direction, intensity, and color temperature, outperforming existing methods on both synthetic and real-world scenes.
- Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
-
This paper proposes the VeilGen + DeVeiler framework, which employs a physical-guided Stable Diffusion generative model to learn latent transmission and glare maps for synthesizing realistic compound degradation training data. By training a restoration network with reversible constraints, it achieves joint removal of aberrations and veiling glare in simplified optical systems.
- Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
-
Ours proposes GvU, which leverages the visual understanding branch of a Unified Multimodal Model (UMM) as an intrinsic reward signal. By constructing a self-supervised RL framework (based on GRPO) through token-level text-image alignment probabilities, it iteratively improves T2I generation quality without external supervision. It achieves a 43.3% improvement on GenEval++, and the enhanced generation in turn promotes fine-grained understanding.
- Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
-
BPGO introduces a "semantic prior anchor" to the GRPO post-training for visual generation. It utilizes the deviation between observed rewards and the prior as an uncertainty signal to perform Bayesian trust allocation across groups (amplifying reliable groups, suppressing ambiguous ones) and prior-anchored reward renormalization within groups (expanding confident deviations, compressing ambiguous scores). It achieves faster convergence and stronger semantic alignment than standard GRPO and DanceGRPO in text-to-image, text-to-video, and image-to-video generation.
- LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
-
The LESA framework is proposed, utilizing Kolmogorov-Arnold Networks (KAN) as learnable temporal predictors. By combining a multi-stage multi-expert architecture with a two-stage training strategy, it achieves \(5\times\) acceleration on FLUX with only a \(1.0\%\) quality degradation. On Qwen-Image, it achieves \(6.25\times\) acceleration with a \(20.2\%\) quality improvement over TaylorSeer, and on HunyuanVideo, it yields a \(24.7\%\) PSNR improvement at \(5\times\) acceleration.
- Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
-
A unified end-to-end color correction framework is proposed to jointly fuse data from a high-resolution RGB sensor and an auxiliary low-resolution multispectral (MS) sensor. By integrating illuminant estimation, illuminant compensation, and color space conversion into a single model, color error (\(\Delta E_{00}\)) is reduced by up to 50% compared to RGB-only and MS-only baselines.
- Leveraging Verifier-Based Reinforcement Learning in Image Editing
-
Edit-R1 proposes a "verifier-style reasoning reward model" (RRM) to replace coarse global scoring in image editing. It decomposes editing instructions into verifiable principles (Keep/Follow/Quality), uses Chain-of-Thought (CoT) for point-by-point verification, and aggregates them into fine-grained scores. A new RL algorithm, GCPO, is introduced to optimize "point-wise reasoning rewards" using paired preference data, boosting the 7B RRM to 82.2% preference prediction accuracy. Finally, this RRM serves as the reward signal for GRPO to optimize editing models like FLUX.Kontext and Qwen-Image-Edit, delivering consistent quality improvements.
- Linear Image Generation by Synthesizing Exposure Brackets
-
Addressing the limitation that existing generative models only produce ISP-compressed sRGB display images lack editing flexibility, this paper proposes the task of text-to-linear image generation. A high-dynamic-range linear image is split into four "exposure brackets" with different exposure levels. A Flux-based flow matching DiT concurrently generates the bracket sequence and irradiance scale, which are then fused into a scene-referred linear image, outperforming various modified baselines with an FID of 28.29.
- LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
-
LoFA uses a hypernetwork to directly predict "full, uncompressed" personalized LoRA weights within seconds. It first identifies structured "response map" patterns in the changes of LoRA relative to base model weights. Then, it utilizes a two-stage hypernetwork to predict these response maps first, followed by utilizing them to guide the prediction of final LoRA weights. This allows it to meet or even exceed the performance of traditional LoRA, which requires hours of per-instance fine-tuning, across various conditions such as text, pose, style, and faces.
- Low-Rank Residual Diffusion Models
-
LRDM identifies that in "near-domain image restoration" (tasks where source and target domains are already highly similar, such as deraining, deblurring, or deshadowing), degradation residuals are inherently low-rank. Consequently, it constrains the forward diffusion process within a low-rank residual subspace while maintaining the reverse process as full-rank. By adaptively adjusting the rank across time steps, the model theoretically tightens the variational lower bound and achieves superior restoration fidelity with fewer sampling steps.
- Low-Resolution Editing is All You Need for High-Resolution Editing
-
ScaleEdit introduces the first formalization of high-resolution image editing. It achieves high-quality editing at 2K or even 8K resolution via test-time optimization by learning a \(1 \times 1\) convolutional transfer function in the intermediate feature space of pre-trained generative models to inject fine-grained textures from the source image, combined with a Blended-Tweedie-based patch synchronization strategy to ensure global consistency.
- LumiX: Structured and Coherent Text-to-Intrinsic Generation
-
LumiX proposes the new task of "text-to-intrinsic" generation based on the FLUX diffusion model: generating a set of pixel-aligned intrinsic maps (color, albedo, irradiance, depth, normals) from a single text prompt. It achieves this through two key designs: Query-Broadcast Attention, which broadcasts the color branch query to all intrinsic maps to ensure structural consistency, and Tensor LoRA, which efficiently models cross-map relationships via tensor decomposition. LumiX achieves a 23% higher alignment score than the SOTA and improves preference scores from -0.41 to 0.19, while the same framework can be reversed for image-conditioned intrinsic decomposition.
- MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
-
Addressing the pain point that real-world scenarios often possess visible cameras but lack infrared cameras, this paper proposes the "Single Image Fusion (SIF)" paradigm. Two diffusion streams are utilized to reinforce intra-spectral knowledge and generate infrared knowledge from a single low-quality visible image. These are fused at the noise level to obtain "MagImg," which balances human perception and downstream semantic decision-making. Using only a single degraded visible image, visual/semantic metrics achieve performance comparable to or exceeding SOTA fusion methods that require paired infrared-visible inputs.
- MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
-
This work proposes MapReduce LoRA and RaTE, two complementary methods to advance the Pareto front in multi-preference optimization: the former pushes the Pareto front progressively via a "Map (parallel training of preference experts) + Reduce (iterative merging)" strategy; the latter enables composable preference control during inference by learning reward-aware token embeddings.
- MapRoute: Semantic Routing for Precise Concept Erasure with Mapper
-
MapRoute inserts a set of lightweight "Mapper" modules after a frozen text encoder. Each Mapper learns a "conditional identity mapping" through two-stage training (mapping the target concept embedding to a surrogate concept while maintaining identity for others). During inference, a top-K semantic router dynamically selects and serially applies relevant Mappers based on the input prompt. This achieves thorough erasure of specified concepts with minimal damage to unrelated concepts, outperforming SOTA methods like MACE and UCE across objects, celebrities, artistic styles, and mixed concepts.
- Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
-
Refactors the Visual Autoregressive (VAR) model from full-context dependency next-scale prediction to Markovian scale prediction based on a Markov process. Through a sliding window history compensation mechanism, it achieves non-full-context modeling, reducing FID by 10.5% and peak memory by 83.8% on ImageNet.
- MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
-
MaskFocus introduces a reinforcement learning post-training framework for Masked Generative Models (MGM). It identifies a few critical sampling steps for image formation using "cosine similarity changes between intermediate and final image embeddings," performing policy optimization only on these steps to avoid the high cost of full-trajectory estimation. Additionally, it employs "entropy-based dynamic routing sampling" to divert high/low entropy samples, balancing exploration and exploitation. This pushes the GenEval score of the open-source MGM Meissonic from 0.54 to 0.76, approaching FLUX across multiple metrics.
- Match-and-Fuse: Consistent Generation from Unstructured Image Sets
-
Match-and-Fuse is proposed as the first training-free consistent generation method for unstructured image sets. By constructing a pairwise consistency graph with images as nodes and image pairs as edges, it manipulates internal features during diffusion inference through Multi-view Feature Fusion (MFF) and feature guidance to achieve set-level cross-image consistency. It achieves a DINO-MatchSim of 0.80, significantly outperforming all baselines.
- MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
-
MatPedia encodes "RGB texture + four PBR maps" into a 5-frame sequence and applies a video diffusion architecture for joint modeling. This unified model handles text-to-material, image-to-material, and intrinsic decomposition tasks. By leveraging massive pure RGB images during training, it outperforms previous specialized methods at a native 1024×1024 resolution.
- MeanFlow Transformers with Representation Autoencoders
-
MeanFlow-RAE migrates the few-step generation model MeanFlow from the traditional SD-VAE latent space to the semantic latent space of a "Representation Autoencoder (RAE)". It utilizes Consistency Mid-training (CMT) for trajectory-aware initialization to stabilize gradient explosions, replaces training-from-scratch with Flow Matching Distillation (MFD), and substitutes JVP with finite differences. Ultimately, it achieves an ImageNet 256 single-step FID of 2.03 (vs. 3.43 for vanilla MF) while reducing sampling GFLOPS by 38% and total training costs by approximately 83%.
- Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
-
The DiT-BlockSkip framework is proposed, reducing LoRA fine-tuning VRAM on FLUX by approximately 50% through timestep-aware dynamic patch sampling (low-resolution training with dynamically adjusted cropping ranges) and a block skipping strategy based on cross-attention analysis for key block selection and residual feature pre-computation, while maintaining personalized generation quality comparable to standard LoRA.
- MERIT: Multi-domain Efficient RAW Image Translation
-
MERIT is the first unified framework to achieve multi-camera RAW-to-RAW translation using a single model. By conditioning on style embeddings, it enables translation from any source domain to any target domain. It explicitly aligns Poisson-Gaussian noise statistics through sensor-aware noise modeling and enhances RAW feature representation with multi-scale large-kernel attention. The authors also release MDRAW, the first multi-domain RAW benchmark. MERIT outperforms previous methods in both image quality (\(+5.56\) dB PSNR) and scalability (approx. 80% reduction in training iterations).
- Meta-CoT: Enhancing Granularity and Generalization in Image Editing
-
Addressing the dilemma where CoT in unified multimodal models for image editing is either "too vague" or "too specialized," Meta-CoT explicitly decomposes any single-image editing task into a triplet of "(Task, Target, Required Understanding)." It further breaks tasks down into 5 combinable "meta-task" bases and employs a "CoT-Editing Consistency" reward for RL alignment. This achieves a 15.8% overall improvement on a 21-category editing benchmark compared to a non-CoT baseline with the same data/parameters, generalizing to numerous unseen tasks by training on only 5 meta-tasks.
- MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
-
To address the lack of high-quality training data for Multi-Image Composition (MICo)—the task of synthesizing people, objects, clothing, and scenes from multiple reference images into a single coherent image—this work constructs the MICo-150K dataset (containing 150,000 identity-consistent samples) and the MICo-Bench. The construction utilizes the proprietary Nano-Banana model combined with a Compose-by-Retrieval prompt strategy, human-in-the-loop filtering, and a "Decompose-and-Recompose" (De&Re) workflow. Furthermore, a Weighted-Ref-VIEScore metric is proposed. Fine-tuning multiple open-source T2I models on this dataset significantly enhances their MICo capabilities, approaching the performance of closed-source models.
- MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
-
This paper introduces MICON-Bench, a multi-image context generation benchmark covering 6 tasks (1043 cases) paired with an MLLM-driven Evaluation-by-Checkpoint automated framework. Simultaneously, it proposes DAR (Dynamic Attention Rebalancing), a training-free mechanism that enhances multi-image consistency and generation quality in UMMs by dynamically adjusting inference-time attention weights.
- Mirai: Autoregressive Visual Generation Needs Foresight
-
Autoregressive (AR) image generators model sequences token-by-token causally, looking only at the "next token," which leads to disordered global structures and slow convergence. This paper proposes Mirai, which introduces an additional "foresight" signal during training. It aligns the intermediate layer representations of the AR model on a 2D grid with the representations of future tokens (either explicit foresight Mirai-E from EMA or implicit foresight Mirai-I from a frozen bidirectional DINOv2 encoder). Without altering the architecture or increasing inference overhead, it accelerates the convergence of LlamaGen-B by up to 10× and reduces the FID from 5.34 to 4.34.
- Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
-
Proposes Mixture of States (MoS)—a multimodal fusion paradigm based on learnable token-level sparse routing, enabling visual tokens to adaptively select hidden states from any layer of the text encoder at each denoising step. This allows 3-5B parameter models to match or exceed the performance of 20B-class models.
- MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
-
MMFace-DiT utilizes a dual-stream DiT that parallelly and equally processes a "text semantic stream" and a "mask/sketch spatial stream" within the same Transformer. Through layer-wise deep fusion using shared RoPE attention and a Modality Embedder that allows switching between mask/sketch conditions without retraining, the model improves FID and other metrics by approximately 40% compared to 6 SOTA methods in both text+mask and text+sketch controllable face generation.
- MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
-
To address the issue in diffusion-based human motion generation where "semantics, style, and history are entangled in a single conditional pathway, leading to long-sequence drift and loss of style control," MoCoDiff uses three lightweight "Injection Modulation Controllers (IMC)" to separately inject text, style, and history into a frozen backbone. By treating history as a "time-varying correction signal that directly rewrites diffusion transition dynamics," a Temporal IMC drives controlled autoregressive diffusion. This achieves the highest style accuracy, lowest jitter, and approximately 4.8× to an order of magnitude inference acceleration for long-sequence stylized motions.
- MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
-
MorphAny3D is proposed as the first training-free 3D morphing framework based on Structured Latent (SLAT) representation. It achieves SOTA quality in cross-category 3D morphing by integrating source/target information via Morphing Cross-Attention (MCA) to ensure structural plausibility, enhancing temporal consistency with Temporal-Fused Self-Attention (TFSA), and eliminating abrupt transitions through an orientation correction strategy.
- MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
-
The MOS framework is proposed to address the Optical-SAR modality gap in ship re-identification via two core modules: (1) MCRL, which narrows the gap during training through SAR denoising and category-level modality alignment loss; (2) CDGF, which utilizes a Brownian Bridge Diffusion Model during inference to generate pseudo-SAR samples from optical images for feature fusion. On the HOSS ReID dataset, it achieves a +16.4% R1 improvement in the SAR→Optical task.
- MPDiT: Multi-Patch Global-to-Local Transformer Architecture for Efficient Flow Matching
-
MPDiT is proposed as a multi-scale patch global-to-local Diffusion Transformer architecture. It utilizes large patches (\(4 \times 4\)) in the early stages to process global context with only 64 tokens, followed by upsampling to small patches (\(2 \times 2\)) with 256 tokens in the later stages to refine local details. This design reduces GFLOPs by up to 50%, with the XL model achieving an FID of 2.05 (cfg) within 240 epochs.
- MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale
-
MRT unifies three layered image tasks—Text-to-Layer (T2L), Image-to-Layer (I2L), and Layer-to-Layer (L2L)—into a single 20B masked region diffusion Transformer. It utilizes "adaptive masking" to determine whether each layer originates from clean latents or noise, and incorporates an "overflow-aware canvas layer" to generate full, reusable RGBA layers that extend beyond canvas boundaries. Trained on 10M design samples, its layering quality surpasses ART and the concurrent Qwen-Image-Layered, with \(10\sim100\times\) faster inference and \(50\%\sim90\%\) lower activation memory.
- Multi-Scale Local Speculative Decoding for Image Generation
-
MuLo-SD introduces a multi-scale approach—"low-resolution draft + upsampling + high-resolution parallel verification"—into speculative decoding. By replacing traditional raster-scan full-sequence backtracking with "local neighborhood resampling of rejected tokens" combined with parallel decoding, it achieves up to \(5.33\times\) end-to-end acceleration for autoregressive image generation on Tar-1.5B/7B while maintaining semantic alignment and perceptual quality.
- MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
-
This paper proposes MultiBanana—the first large-scale benchmark to systematically evaluate multi-reference image generation capabilities. It comprises 3,769 evaluation samples with up to 8 reference images across 5 difficulty dimensions (cross-domain, scale, rare concepts, and multilingualism), revealing a complementary failure mode where closed-source models "overfit reference details" while open-source models "ignore reference subjects."
- MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
-
MultiCrafter decomposes "multi-subject customized generation" into two non-conflicting training phases: pre-training uses explicit positional supervision to constrain each subject's attention to correct spatial regions to eliminate attribute crosstalk and employs MoE-LoRA for complex layout capacity; post-training utilizes an online reinforcement learning framework with Hungarian matching for scoring to maximize aesthetic and text alignment, significantly outperforming current In-Context-Learning (ICL) methods in subject fidelity.
- NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
-
NAMI partitions the rectified flow of text-to-image generation into multiple time windows based on resolution. Low-resolution stages utilize fewer Transformer layers to rapidly construct layouts, while high-resolution stages gradually stack layers for detail refinement. A learnable BridgeFlow module aligns distributions between adjacent stages. At a 2B parameter scale, it reduces inference time for \(1024 \times 1024\) images by 64% while maintaining quality comparable to state-of-the-art models.
- Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
-
Ours proposes NLCE, a training-free three-stage concept erasure framework. It achieves precise localized erasure of target concepts while explicitly preserving semantically proximal concepts through spectrally weighted representation modulation, attention-guided spatial gating, and gated feature scrubbing. NLCE outperforms existing methods on Oxford Flowers, Stanford Dogs, celebrity identities, and sensitive content erasure tasks.
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
-
This paper reinterprets SDE-based GRPO as distance optimization/contrastive learning and proposes Neighbor GRPO. It completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbed ODE initial noise and implements policy gradient optimization via a softmax distance proxy policy, thereby preserving all advantages of deterministic ODE sampling.
- Not All Birds Look The Same: Identity-Preserving Generation For Birds
-
Addressing the lack of "multi-view images of the same individual" for fine-grained birds, this paper constructs a benchmark (NABLA) of 4,759 "look-alike" bird pairs using expert annotations from NABirds. It proposes using "same species / age / sex / breeding stage" as identity proxies to train controllable diffusion models like OminiControl and Insert Anything, achieving an approximately 41% reduction in MSE compared to baselines and demonstrating generalization to unseen species.
- NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
-
NOVA is proposed, formalizing the "Sparse Control, Dense Synthesis" paradigm for video editing for the first time: the sparse branch provides semantic guidance from multiple user-edited keyframes, while the dense branch injects motion and texture information from the original video. Combined with a degradation simulation training strategy, it enables learning without paired data, significantly outperforming existing methods in editing fidelity, motion preservation, and temporal consistency.
- Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
-
This paper proposes Object-WIPER, the first training-free framework for removing video objects and their associated effects (shadows, reflections, mirrors, etc.). It leverages text-visual cross-attention and visual self-attention in DiT to localize associated effect regions. Clean removal is achieved through foreground re-initialization and attention scaling. The paper also introduces the TokSim metric and the WIPER-Bench real-world benchmark.
- OctoT2I: A Self-Evolving Agentic Text-to-Image Router
-
OctoT2I reframes the selection of text-to-image models for a given prompt as a constrained optimization problem: choosing the tool with the minimum cost while satisfying a quality threshold. By employing a multi-turn agentic router supported by a zero-human-annotation, self-built tool knowledge base (PSEL self-evolving loop), the method achieves an overall score of 0.96 on GenEval. Compared to the strong baseline Flow-GRPO, it achieves a 90.3% speedup and a 56.6% improvement in energy efficiency.
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
-
Addressing the issue where holistic embeddings extracted by generic image encoders (CLIP/DINOv2/VAE) in existing personalization methods are "entangled," often carrying over irrelevant information like lighting and clothing (copy-and-paste artifacts), Omni-Attribute allows the encoder to ingest both an "image + a textual attribute description." It specifically learns to encode open-vocabulary embeddings for designated attributes only (identity/expression/lighting/style, etc.). Through "positive/negative attribute paired data + a dual-objective training of generative and contrastive losses," it achieves SOTA results in attribute retrieval, personalization, and multi-attribute composition.
- Omni2Sound: Towards Unified Video-Text-to-Audio Generation
-
This paper aims to train a single model to simultaneously excel in video-to-audio (V2A), text-to-audio (T2A), and video-text-to-audio (VT2A). The research identifies two primary hurdles: the scarcity of high-quality V-A-T aligned captions and the competition between/within tasks. To address these, the authors developed SoundAtlas, a dataset of 470k pairs of tightly aligned captions generated via an agent-based labeling pipeline. This is combined with Omni2Sound, a decoupled dual-branch DiT model with a three-stage progressive training strategy, achieving SOTA performance across all three tasks using a standard DiT backbone.
- Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
-
Omni IIE Bench is a high-quality human-annotated benchmark specifically designed to diagnose the "consistency of instruction-based image editing models across semantic scales." Using a dual-track design consisting of "single-turn consistency + multi-turn coordination (up to 16 turns)," 2856 samples were constructed from 12 data sources through a three-stage process (auto-generation → auto-masking → multi-pass rigorous human review). A decoupled evaluation framework (global quality + foreground/background fidelity + instruction compliance) was proposed, quantifying for the first time a universal failure mode: almost all mainstream editing models suffer significant performance degradation when switching from low to high semantic scales, which further collapses in multi-turn scenarios due to error accumulation.
- OmniGen2: Towards Instruction-Aligned Multimodal Generation
-
OmniGen2 adopts a unified "decoupled VLM + Diffusion" architecture (where VLM handles understanding and Diffusion handles generation, conditioned on VLM variable-length hidden states and VAE features). By combining Omni-RoPE position encoding with a two-stage training strategy—"building a strong base followed by progressive RL alignment"—the model precisely follows complex instructions across text-to-image, image editing, and in-context generation tasks, achieving a GenEval score of 0.95.
- One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
-
Ours proposes ELIT (Elastic Latent Interface Transformer), which inserts a variable-length latent interface and lightweight Read/Write cross-attention layers into DiT. This enables a single model to dynamically adjust the computational budget during inference while non-uniformly allocating computation to harder regions of an image, achieving up to a 53% reduction in FID on ImageNet 512px.
- OneHOI: Unifying Human-Object Interaction Generation and Editing
-
OneHOI uses a Diffusion Transformer (R-DiT) to unify "HOI image generation" and "HOI image editing" into a single conditional denoising process. By explicitly modeling interaction structures through an HOI encoder, verb-mediated structured attention, and HOI-specific RoPE, it achieves SOTA results in editing, layout-controllable generation, and the newly proposed multi-HOI editing task.
- OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance
-
OntoAug explicitly decomposes an image into the "ontology part" (foreground subject) and the "incidental part" (background). It uses the foreground mask as a hard constraint for diffusion inpainting to modify only the background while keeping the subject unchanged. Combined with geometric layout transformations and a background vocabulary expanded by LVLM/LLMs, it simultaneously achieves "subject stability, background diversity, and overall coordination," reaching SOTA performance on fine-grained classification, few-shot learning, WSOL, and VLM reinforcement fine-tuning.
- OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
-
OpenDPR proposes a training-free vision-centric framework that leverages diffusion models to offline generate diverse visual prototypes for target categories. During inference, it identifies open-vocabulary changes in remote sensing images through similarity retrieval in visual space, achieving SOTA performance on four benchmark datasets.
- OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
-
This paper proposes OPRO, a parameter-efficient adaptation method based on orthogonal matrices. By imposing learnable panel-specific orthogonal operators on the position-aware query/key of a frozen backbone, it explicitly modulates inter-panel attention interactions while preserving pre-trained intra-panel synthesis behavior. With only 0.93M additional parameters, OPRO significantly enhances the editing quality of several SOTA methods on MagicBrush.
- OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
-
The first training-free fusion method for multiplicative orthogonal adapters (OFT): it treats Group-and-Shuffle (GS) orthogonal matrices as points on a Riemannian manifold, synthesizes a "concept adapter" and a "style adapter" into one via block-level geodesic interpolation, and applies a spectral recovery transform to restore eigenvalues flattened by interpolation. This allows merging a specified subject with a specific artistic style into a single image without retraining.
- OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
-
OSPO enables a unified Multimodal Large Language Model (Unified MLLM) to self-generate preference image pairs that share global semantics but differ in object details. By utilizing object masks derived from attention to weight the SimPO loss, it significantly enhances fine-grained object-level alignment and suppresses object hallucinations in T2I generation without relying on external data or models.
- Parallel Jacobi Decoding for Fast Autoregressive Image Generation
-
Addressing the "token-by-token serial, extremely slow inference" bottleneck in autoregressive (AR) image generation, this paper proposes a training-free Parallel Jacobi Decoding (PJD). It transforms the 1D Jacobi draft into a 2D "row-parallel" expansion along the image grid, utilizing a row-causal attention mask to suppress error accumulation. It achieves 4.8×–6.4× speedup on Lumina-mGPT / LlamaGen with negligible impact on image quality.
- ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
-
In "understanding-generation unified" multimodal models, ParaUni shifts from using only the final layer of VLM as the diffusion condition to a parallel integration of all VLM layer visual features through a Layer Integration Module (LIM). During the RL stage, a Layer-wise Dynamic Adjustment Mechanism (LDAM) is employed to specifically perturb different layers based on distinct rewards, thereby enhancing both fine details and semantic alignment. It achieves a GenEval of 0.87 and a DPG-Bench score of 83.45.
- PhotoFramer: Multi-modal Image Composition Instruction
-
PhotoFramer formulates "how to take better-composed photos" as a unified understanding-generation model: given a poorly composed image, it first clearly explains how to improve using natural language (e.g., "remove the fence, center the subject"), and then generates an example image of the same scene with good composition, allowing amateur photographers to re-shoot by following both the text and the example.
- PhyCo: Learning Controllable Physical Priors for Generative Motion
-
PhyCo enables video diffusion models to generate motion consistent with physics (friction, restitution, deformation, external forces) in a continuous and controllable manner without relying on any simulators or geometric reconstruction during inference. This is achieved through a triad: a 100k physical simulation dataset, supervised fine-tuning using ControlNet with pixel-aligned physical attribute maps, and differentiable reward optimization using a fine-tuned VLM for physical Q&A scoring. It improves the IQ Score on the Physics-IQ benchmark from a baseline of ~28 to 43.6.
- PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
-
PhysGen is proposed as a unified framework that integrates physical constraints (aerodynamic efficiency) into 3D shape generation. By jointly encoding geometric and physical information into a unified latent space via a Shape-and-Physics VAE, the model iteratively alternates between velocity updates and physical refinement within a Flow Matching framework. This process generates 3D shapes, such as low-drag vehicles, that are both visually realistic and physically efficient.
- Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
-
The paper proposes ReMD (Residual-Multigrid Diffusion), which embeds multigrid residual correction into each reverse sampling step of the diffusion model. By utilizing multi-wavelet bases to construct a cross-scale hierarchical structure, it achieves physics-consistent and efficient fluid super-resolution without requiring explicit PDEs.
- Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
-
The authors utilized Nano-Banana (Gemini-2.5-Flash-Image) to batch-generate approximately 400,000 instruction-based image editing samples on real photos from OpenImages. Using Gemini-2.5-Pro for automated quality inspection, they constructed Pico-Banana-400K, an open-source dataset covering 35 editing types that supports single-turn SFT, preference learning, and multi-turn editing research.
- Pixel Motion Diffusion Is What We Need for Robot Control
-
DAWN proposes a two-stage full diffusion framework: the Motion Director generates dense pixel motion fields as an interpretable intermediate representation, and the Action Expert transforms these into executable robot action sequences. It achieves SOTA on CALVIN (Avg Len 4.00), MetaWorld (Overall 65.4%), and real-world tasks, with model capacity and training data requirements significantly lower than competing methods.
- PixelDiT: Pixel Diffusion Transformers for Image Generation
-
PixelDiT proposes a dual-layer pixel-space diffusion model based entirely on Transformers: a patch-level DiT captures global semantics and a pixel-level DiT refines texture details. Without a VAE, it achieves 1.61 FID on ImageNet and allows direct training of text-to-image models in 1024 resolution pixel space.
- PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
-
The first method to bring training-free high-resolution generation into the practical stage—by employing a partial inversion strategy to make few-step diffusion models viable for patch refinement, it generates 4K images in 20 seconds, representing a \(10 \times\) to \(35 \times\) speedup over existing methods with superior quality.
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
-
This work proposes the PPCL framework, which detects contiguous redundant layer intervals in MMDiT using linear probes and implements depth pruning (pluggable) and width pruning (replacing text streams/FFNs with linear projections) via non-sequential distillation. It compresses Qwen-Image from 20B to 10B with only a 3.29% performance drop.
- POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
-
Addressing the conflict between "text accuracy" and "overall image harmony/aesthetics" in visual text generation, POCA reformulates GRPO multi-reward alignment as a multi-objective optimization problem. It utilizes bi-directional Pareto ranking in the joint reward space to select non-dominated (good) and dominated (poor) samples as positive and negative signals. Combined with an adaptive curriculum based on the ECDF of OCR rewards to arrange training data from "easy to difficult," it simultaneously improves Sen.ACC, CLIP, and HPS on the AnyText-benchmark.
- POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
-
The authors simultaneously collected the largest open-source face OLAT (One-Light-at-a-Time) dataset, POLAR (220 subjects, 156 light directions, 32 views, 16 expressions, 4K), and trained a generative model, POLARNet, based on "latent bridge matching." POLARNet generates single-light responses in various directions directly from a flat-lit portrait in a single step, followed by linear composition for relighting under arbitrary HDR environment lighting, achieving physical consistency and cross-identity generalization.
- PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
-
PortraitDirector reformulates facial reenactment from "driving an entangled holistic motion signal" to a "hierarchical composition task." It disentangles head pose, local expressions (eyes/mouth), and global emotions through spatial, semantic, and composite layers before recombining them. A global emotion filtering module based on the Information Bottleneck principle is introduced to remove residual emotions from local motions. Combined with diffusion distillation, causal attention, and a lightweight VAE, it achieves controllable real-time reenactment at 512×512 resolution, 20 FPS, and 800 ms latency on a single NVIDIA 5090.
- PositionIC: Unified Position and Identity Consistency for Image Customization
-
PositionIC utilizes an automated data synthesis pipeline (BMPDS) to generate multi-subject paired data with positional annotations. It then employs a NeRF-inspired "Visibility-Aware Attention" mechanism to restrict each reference subject's attention range within a specified bounding box. This approach achieves SOTA identity fidelity and spatial controllability for multi-subject customization without introducing additional training parameters or inference overhead.
- PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
-
PosterOmni decomposes "image-to-poster" generation into six tasks across two categories: local editing (expansion/inpainting/scaling/identity preservation) and global creation (layout/style transfer). It first trains local and global experts, then integrates them into a single student model via task distillation. Finally, a unified reward model and DiffusionNFT reinforcement learning are used to align aesthetics and instructions. This single model outperforms all open-source editing models on the custom PosterOmni-Bench and approaches or exceeds closed-source commercial systems like Seedream-4.0.
- PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation
-
PosterReward automatically constructs a 70K poster preference dataset using consensus from multiple MLLMs and employs an "image analysis-driven" four-stage cascaded training. This results in the first reward model specifically designed to evaluate the generation quality of posters and graphic designs, improving accuracy from a baseline of 40%~53% to 86% on both self-built and public preference benchmarks.
- Precise Object and Effect Removal with Adaptive Target-Aware Attention
-
The ObjectClear framework is proposed, which decouples foreground removal from background reconstruction via Adaptive Target-Aware Attention (ATA). Combined with Attention-Guided Fusion (AGF) and Spatially-Varying Denoising Strength (SVDS) strategies, it achieves precise removal of target objects and their incidental effects (shadows, reflections). Additionally, the first large-scale Object-Effect Removal dataset, OBER, is constructed.
- Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
-
Premier represents each user's preference as a learnable embedding. A preference adapter fuses this embedding with text prompts to output per-token modulation directions injected into the MM-DiT modulation mechanism. A dispersion loss is employed to separate preference directions among different users. To address the cold-start problem for new users with scarce data, the model uses a "linear combination of existing user embeddings," enabling the generation of personalized images based on preference images without any textual preference descriptions.
- Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
-
LivingSwap is proposed as the first video-reference-guided face swapping model. Utilizing a controllable pipeline of keyframe identity injection, source video reference completion, and temporal stitching, it achieves high-fidelity face swapping in long videos. By maintaining source details such as expressions, lighting, and motion while consistently injecting the target identity, it reduces manual editing effort by 40 times.
- Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
-
This work systematically probes affordance capabilities within Vision Foundation Models (VFMs). It discovers that DINO encodes part-level geometric structures while Flux encodes verb-conditioned interaction priors. By fusing both in a training-free manner, the authors achieve zero-shot affordance estimation competitive with weakly supervised methods.
- ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
-
ProcessMaker implements cross-domain procedural sequence generation on Flux.1 (DiT) using "sparse mask LoRA + self-supervised representation alignment." It employs a sliding window to adaptively add or remove steps based on frame differences. By training only 7.3% of the parameters, it outperforms MakeAnything in alignment and coherence across 21 domains.
- Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
-
GridAR proposes a training-free test-time scaling framework for visual autoregressive (AR) models. By partitioning the canvas into row blocks, generating multiple partial candidates in parallel, and pruning incorrect trajectories early, combined with "layout-specified prompt reconstruction" to provide a global blueprint for subsequent decoding, it outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ using only N=4 while saving 25.6% compute.
- ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
-
The authors unify a broad category of human motion control tasks (trajectory following, 2D→3D lifting, motion completion, cyclic actions, etc.) into linear inverse problems. They propose ProjFlow—a training-free flow matching sampler that utilizes closed-form projections at each denoising step to pull "clean motion estimates" onto the constraint set. By incorporating a "kinematic-aware metric" that encodes skeleton topology, corrections are propagated coordinately along the bones, achieving exact satisfaction of hard constraints under zero-shot conditions without inner-loop optimization while maintaining motion naturalness.
- PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
-
PROMO is based on the FLUX Flow Matching DiT backbone. Through latent space multimodal condition concatenation, temporal self-reference KV caching, 3D-RoPE grouped conditions, and a fine-tuned VLM style prompt system, it achieves high-fidelity and efficient multi-garment virtual try-on without the need for a traditional reference network. The inference speed is 2.4x faster than the non-accelerated version, and it outperforms existing VTON and general image editing methods on VITON-HD and DressCode.
- Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
-
VLTok functionally splits 1D visual token sequences into "visual tokens + text tokens." During training, Self-Prompted Alignment (SPA) distills fine-grained semantics from a pretrained text encoder into text tokens; during inference, the text encoder is discarded to maintain a vision-only workflow. On ImageNet, VLTok reduces rFID by 11.1% and gFID by 18.7% compared to GigaTok with the same parameter count.
- PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
-
To address the issue where T2I models struggle with complex prompts (attribute binding, negation, and compositional reasoning), this paper proposes PromptEnhancer—a model-agnostic rewriting framework that does not modify T2I weights. It initializes a rewriter using CoT data via SFT and then performs policy alignment using GRPO with AlignEvaluator, a specialized reward model scoring 24 fine-grained keypoints. This allows the rewriter to transform short, vague user prompts into structured, detailed descriptions that any frozen T2I can accurately execute, achieving an average improvement of 5.1 points in text-to-image alignment.
- PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
-
PromptLoop utilizes a multimodal large language model (MLLM) trained via Reinforcement Learning (RL) as a policy to step-by-step read intermediate latent variables and iteratively rewrite prompts during the diffusion sampling process. This allows a "prompt refinement only, no weight modification" alignment approach to achieve a closed-loop structure isomorphic to direct fine-tuning of diffusion model weights, thereby improving reward alignment, enhancing cross-model generalization, and suppressing reward hacking in a plug-and-play manner, with an inference overhead increase of only approximately 20%.
- Property-Informed Diffusion-Based Text-to-Microstructure Generation
-
PropDiff-TMG utilizes a self-conditioned 3D diffusion model to directly generate 3D metamaterial microstructures from natural language descriptions (augmented with physical quantities such as Young's modulus, anisotropy, and volume fraction). A dual-alignment mechanism—comprising "contrastive alignment during training + reward-guided alignment during testing"—ensures the generated structures are both semantically consistent and physically feasible. On the Geometries 2000 dataset, it reduces FID from 72.08 to 70.81, improves CLIP score from 0.56 to 0.69, and decreases CD from 0.093 to 0.040.
- Prototype-Guided Concept Erasure in Diffusion Models
-
Addressing the difficulty of thoroughly erasing broad concepts (e.g., violence, pornography) in diffusion models, this paper proposes a training-free erasure method based on concept prototypes. It extracts image prototypes by clustering concept difference directions in the CLIP embedding space, transfers them to the text prototype space via optimization, and selects the best-matching prototype during inference as a negative guidance signal for classifier-free guidance-style concept suppression.
- Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
-
To address the issues of "unfaithful appearance" and "semantic loss" when directly applying DreamBooth-style fine-tuning to multimodal Autoregressive (AR) models, this paper proposes Proxy-Tuning: a weaker diffusion model first learns the subject from a few reference images and synthesizes batch proxy data to supervise the AR student model. The results show the student surpasses the teacher in subject fidelity, revealing a "weak-to-strong generalization" phenomenon in image generation.
- PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
-
This paper proposes PSDesigner, an automated graphic design system that simulates the creative workflow of human designers. By coordinating three modules—AssetCollector (resource collection), GraphicPlanner (planning tool calls), and ToolExecutor (executing PSD operations)—and training on CreativePSD, the first design dataset in PSD format, the model learns professional design processes and directly generates editable PSD design files.
- PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
-
To address poor subject consistency and insufficient text compliance in multi-subject personalized image generation, this paper proposes a scalable multi-subject data construction pipeline and Pairwise Subject-Consistency Rewards (PSR). Through a two-stage training process (SFT + RL), the method comprehensively outperforms existing SOTA on the self-constructed PSRBench.
- PureCC: Pure Learning for Text-to-Image Concept Customization
-
The PureCC method is proposed to achieve high-fidelity concept customization while minimizing the impact on the original model's behavior and capacity. This is accomplished by decoupling the learning objective into "target concept implicit guidance" and "original condition prediction," utilizing a dual-branch training pipeline (frozen representation extractor + trainable flow model) and an adaptive guidance scaling factor \(\lambda^{\star}\).
- Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
-
This work decomposes a single RGB image into multiple semantically decoupled RGBA layers end-to-end. Each layer can be independently edited without affecting other content, fundamentally solving semantic drift and geometric misalignment in raster image editing. The decomposition quality significantly exceeds previous recursive methods.
- RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
-
This paper proposes the RAISE framework, which models T2I generation as a requirement-driven adaptive evolutionary process. By decomposing prompts into structured checklists via a requirement analyzer, the framework concurrently evolves candidate populations through multi-action mutations (prompt rewriting, noise resampling, and instruction editing). It then employs tool-augmented visual verification to eliminate candidates that fail to meet requirements in each round. This achieves adaptive inference-time scaling—reaching a SOTA score of 0.94 on GenEval while reducing generated samples by 30-40% and VLM calls by 80% compared to reflection fine-tuning baselines.
- RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection
-
To address the scarcity of annotations for semantic segmentation (SS) and change detection (CD) in remote sensing—compounded by the task-specific nature of existing generation methods, lack of multispectral support, and sensitivity to noisy samples—this paper proposes RDF-MIG. By utilizing Feature Compression Fusion (FCF), multispectral images and masks are integrated into a three-channel tensor for joint diffusion generation. Concurrently, a robust MCRD loss based on correntropy combined with MSE consistency calibration is employed to suppress heavy-tailed noise. This single framework synthesizes aligned image-mask pairs for both SS and CD tasks, enhancing downstream performance.
- Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
-
Addressing the "misalignment" in unified multimodal models where reasoning capability fails to guide image generation, Re-Align utilizes structured In-Context Chain-of-Thought (decomposed into semantic guidance and reference association) to reduce complex interleaved tasks into text-to-image generation. By applying GRPO reinforcement learning with a CLIP similarity-based proxy reward, it achieves state-of-the-art results on OmniContext and DreamOmni2Bench among comparable models.
- RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
-
This paper proposes RealUnify, the first benchmark specifically designed to evaluate the bidirectional synergy between understanding and generation capabilities in unified models. Through 1000 human-annotated instances and a dual evaluation protocol (direct and stepwise), it reveals that while current unified models possess both understanding and generation capabilities, they fail to achieve true capability synergy in end-to-end scenarios.
- ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
-
Existing instruction editing models based on "MLLM Encoder + Diffusion Decoder" often freeze the MLLM, leaving its reasoning capabilities underutilized. ReasonEdit unlocks the MLLM's "Thinking" (translating abstract instructions into concrete executable steps) and "Reflection" (multi-round self-inspection and deciding when to stop) through joint optimization. This forms a thinking–editing–reflection closed loop, delivering consistent performance gains across ImgEdit, GEdit, and Kris benchmarks on both Step1X-Edit and Qwen-Image-Edit backbones.
- RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits
-
Addressing the neglected issue of "severe structural imbalance in timestep credit assignment" when applying GRPO to Discrete Diffusion Models (DDM), this paper derives the mathematical roots of this imbalance from policy gradients. It proposes RebRL, a plug-and-play method that flattens cumulative gradients using two levels of rebalancing factors—timestep-level and token-level. It achieves SOTA on GenEval, improves human preference scores by up to 3.40, and reduces training steps by approximately 40%.
- RecTok: Reconstruction Distillation along Rectified Flow
-
To address the paradox where higher latent dimensions in visual tokenizers lead to poorer generation quality, this paper proposes RecTok. Instead of injecting semantics only into clean latents \(x_0\), it performs Flow Semantic Distillation (FSD) and Masked Reconstruction Alignment Distillation (RAD) along the entire forward trajectory \(\{x_t\}\) of the rectified flow. This breaks the dimension bottleneck, allowing reconstruction, generation, and discriminative performance to improve consistently with dimensionality. It achieves a SOTA gFID of 1.34 on ImageNet 256 without CFG, with convergence 7.75x faster than previous methods.
- Refaçade: Editing Object with Given Reference Texture
-
Refaçade extends "object retexturing" (repainting a target object using local textures from a reference image while preserving its original geometry) from images to videos. The core consists of two decoupling strategies: training a "texture eraser" to reduce the source object into an untextured video containing only geometry, and using "jigsaw permutation" to break the reference image into texture fragments without global structure. This achieves precise and controllable texture transfer on both images and videos, outperforming several strong baselines in both quantitative and human evaluations.
- Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
-
The MVC-ZigAL framework is proposed to enhance single-view fidelity and cross-view consistency of few-step text-to-multiview diffusion models through multi-view-aware MDP modeling, zigzag self-reflective advantage learning, and Lagrangian dual constrained optimization.
- Refracting Reality: Generating Images with Realistic Transparent Objects
-
Addressing the long-standing failure of text-to-image models in generating realistic refraction for transparent objects, this paper proposes the training-free Snellcaster. It applies Snell's Law to perform "refraction self-warping" at each step of the FLUX generation trajectory. By utilizing an auxiliary panoramic image centered at the transparent object to fill in surfaces not visible to the camera, it strictly constrains refraction and reflection to physical correctness. Masked PSNR is improved from ~12.7 to 16.5, and LPIPS is reduced from 0.47 to 0.24.
- Region-Adaptive Sampling for Diffusion Transformers
-
RAS is a training-free sampling strategy that identifies "fast-update regions" currently focused on by the model and sends only those into the DiT for denoising. "Slow-update regions" directly reuse noises cached from the previous step. This spatially non-uniform computation allocation achieves 2.36×/2.51× speedups on Stable Diffusion 3 and Lumina-Next-T2I with almost no loss in quality.
- Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
-
This paper discovers that the relational distance between image patch pairs remains invariant after AI editing. It leverages this invariance to construct Rel-Zero, a zero-watermarking framework that achieves robust content authentication against various generative edits without modifying the original image.
- RenderFlow: Single-Step Neural Rendering via Flow Matching
-
The authors propose RenderFlow, which reformulates neural rendering as a single-step conditional flow matching problem from albedo to full-light images. Utilizing G-buffer as a condition and a pre-trained video DiT as the backbone, the method achieves deterministic rendering over 10 times faster than diffusion-based methods (~0.19s/frame). Optional sparse keyframe guidance further improves physical accuracy, while inverse rendering is supported via a frozen backbone and lightweight adapters.
- ResCa: Residual Caching for Diffusion Transformers Acceleration
-
ResCa is a training-free acceleration framework for Diffusion Transformers. It performs actual denoising on only a single "proxy token" within each trajectory cluster and uses its multi-order residuals to "simulate" the denoising direction of other tokens in the same cluster. This achieves a 5.5× GFLOPs speedup on FLUX with almost no loss in image quality.
- Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
-
To address the issues of blurred strokes and distorted glyphs in autoregressive (AR) image generation during text rendering, this paper identifies the root cause as the insufficient reconstruction capability of the visual tokenizer. It proposes the Residual Decoder Adapter (RDA): freezing the original tokenizer and AR model, while attaching a Shared-ID Hint codebook and a pixel-level residual decoding branch. This restores text reconstruction quality without changing the token space or retraining any models—boosting Janus-Pro 1B's OCR accuracy from 24.52% to 58.26%.
- Residual Diffusion Bridge Model for Image Restoration
-
This paper re-derives diffusion bridges as stochastic interpolations unified by a "mean-reverting OU process + Doob h-transform." It uses the residual \(\boldsymbol{\pi}=\mathbf{x}_0-\boldsymbol{\mu}\) of paired images to modulate noise injection and removal, ensuring that the model only applies perturbations to degraded regions while protecting clean areas from iterative reconstruction. This approach achieves an average gain of 1.55 dB PSNR across five universal restoration tasks (deraining, low-light enhancement, desnowing, dehazing, and deblurring) while proving existing bridge models to be special cases of this framework.
- Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
-
The authors discover that diffusion bridge models, exemplified by I2SB, undergo "endpoint underfitting"—variance collapse and directional misalignment—when approaching the target endpoint (\(t\to0\)). The root cause is the conflicting noise magnitude trends between the network input and the regression target. They propose NADB, which utilizes "magnitude-aligned stochastic interpolation" to correct variance and a mean network to align endpoints to fix direction, consistently outperforming I2SB across multiple ImageNet restoration and translation tasks.
- Resolving the Identity Crisis in Text-to-Image Generation
-
This paper reveals the "identity crisis" (duplicated faces, identity merging) in text-to-image models during multi-person generation. It proposes the DisCo framework, which utilizes compositional reward functions and Group Relative Policy Optimization (GRPO) to fine-tune flow-matching models. DisCo achieves a 98.6% unique face accuracy, surpassing closed-source models including GPT-Image-1.
- Rethinking Glyph Spatial Information in Font Generation
-
Addressing few-shot font generation (FFG), this paper points out that existing methods ignore "glyph spatial information" by destroying control point coordinates with distorted rendering in data pipelines and implicitly coupling "shape" and "position" in model optimization. It proposes a spatial-preserving rendering scheme, SPR (with an OFL Chinese font dataset and normalized metrics), to enable reversible mapping between raster and vector formats. Additionally, it designs the two-stage GlyphSpatialNet—incorporating shape-position decoupling (SPD), gradient broadcasting (GBM), and stylistic detail enhancement (SDE)—to explicitly model spatial transformations in pixel space, achieving new SOTA results on a unified benchmark without any component or stroke labels.
- Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
-
This paper introduces PRIS: In text-to-image/video inference-time scaling, instead of simply increasing computation for "sampling more images," it utilizes a fine-grained verifier (EFC) to identify "common failure elements" recurring across multiple samples. It then redesigns the prompt for regeneration, allowing the prompt and visual quality to scale together with computation. This achieves a \(+7\%\) gain on GenAI-Bench and a \(+15\%\) gain on VBench 2.0.
- Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
-
Addressing the bottlenecks of "scarcity of image-text pairs" and "inefficient training" in the visual generation components of Unified Multimodal Models (UMM), this paper proposes IOMM, a two-stage framework. It first pre-trains on massive unlabeled images using image semantics as self-conditions via masked reconstruction, followed by hybrid fine-tuning with limited high-quality image-text pairs. Using only ~1050 H800 GPU hours, a 3.6B model trained from scratch achieves 0.89 on GenEval and 0.55 on WISE, surpassing strong baselines like BAGEL-7B and BLIP3-o.
- Reviving ConvNeXt for Efficient Convolutional Diffusion Models
-
This paper proposes FCDM (Fully Convolutional Diffusion Model), adapting the ConvNeXt architecture as a conditional diffusion model backbone. Using only 50% of the FLOPs of DiT-XL, it achieves a competitive FID (2.03) on ImageNet and allows training an XL-sized model on four RTX 4090 GPUs, demonstrating the significantly undervalued efficiency advantages of fully convolutional architectures in generative modeling.
- Reward Sharpness-Aware Fine-Tuning for Diffusion Models
-
This paper diagnoses "reward hacking" (where reward scores increase while visual quality degrades) in Reward-Directed Reinforcement Learning (RDRL) for diffusion models as a form of "adversarial attack." The core issue is that reward models lack robustness in directions where their loss surfaces are steep. To address this, RSA-FT is proposed. Instead of retraining the reward model, it utilizes the gradient of a "smoothed" reward model. This is achieved by simultaneously applying perturbations in the image space (adversarial input perturbation) and parameter space (SAM-style weight perturbation) to find the local worst reward. This dual approach significantly mitigates reward hacking and can be integrated as a plug-and-play module into various RDRL frameworks such as ReFL, DRaFT-K, AlignProp, and DRTune across multiple diffusion backbones.
- RewardFlow: Generate Images by Optimizing What You Reward
-
RewardFlow proposes an inversion-free inference-time framework that integrates multiple differentiable reward signals—including semantic alignment, perceptual fidelity, local positioning, object consistency, and human preference—via multi-reward Langevin dynamics. It achieves SOTA editing fidelity and compositional alignment in image editing and compositional generation tasks.
- Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
-
This paper introduces the new task of "Portrait Collection Generation (PCG)"—generating a set of portraits with consistent identity and details but varying poses, perspectives, and compositions, given a reference portrait and natural language editing instructions. For this purpose, the first large-scale dataset, CHEESE (~24K collections, 576K triplets, annotated via Large Vision-Language Models + inversion verification), was constructed, and the SCheese framework was designed (Fusion IP-Adapter for identity, ConsistencyNet + Decoupled Attention for details), achieving Prev. SOTA performance in Prompt Following (PF) and Detail Preservation (DP).
- Scale Space Diffusion: Integrating Scale Space into the Diffusion Process
-
This paper argues that "diffusion noising" and "scale-space downsampling" are nearly equivalent in terms of information degradation—high-noise states carry no more information than a small image. By treating "progressive downsampling" as the degradation operator in the diffusion process, the authors derive a family of Scale Space Diffusion (SSD) based on generalized linear degradation. This allows the model to perform high-noise steps at low resolutions and low-noise steps at high resolutions. Accompanied by a Flexi-UNet that activates only relevant network layers, the method reduces training time and FLOPs by more than half on CelebA / ImageNet at the cost of a slight FID increase.
- Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm
-
UMO reformulates "multi-identity customization" as a global assignment problem between multiple reference images and multiple generated faces. By utilizing a plug-and-play Reference Reward Feedback Learning (ReReFL) framework combined with a Multi-Identity Matching Reward (MIMR) based on Hungarian matching, it significantly enhances identity similarity and suppresses identity confusion without retraining the base models.
- ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
-
ScenDi decomposes urban scene generation into a "3D coarse generation → 2D refinement" cascaded diffusion process. It first employs a 3D Latent Diffusion Model to generate a 3D Gaussian scene with coarse appearance (ensuring camera controllability) and then utilizes a video diffusion model to refine details and synthesize distant backgrounds on the rendered frames, achieving both high-fidelity quality and precise camera trajectories on Waymo and KITTI-360.
- SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
-
SCIEval is a faithfulness evaluator specifically designed for "scientific images" (line charts, binary trees, molecular formulas, etc., containing precise numerical/attribute data). It decomposes faithfulness into three dimensions: relevance, accuracy, and explainability. By training two scoring sub-modules via CLIP contrastive learning and fine-tuning a lightweight LMM to generate error explanations, it provides a comprehensive assessment. Accompanied by the manually annotated SCIEval-Bench (6,000 samples), SCIEval achieves significantly higher correlation with human judgment compared to 24 competitors, including GPT-4o.
- Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
-
Scone builds upon the unified understanding-generation model BAGEL by transforming the "understanding expert" into a semantic bridge. Through early multimodal alignment and attention masking to filter out irrelevant subjects in reference images, it guides the "generation expert" in an end-to-end manner. This enables the model to accurately "identify then generate" even when a reference image contains multiple candidate subjects, achieving state-of-the-art performance among open-source models on OmniContext.
- Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring
-
Score2Instruct proposes SIG, an automated video quality instruction generation pipeline that requires no human annotation or closed-source APIs. By automatically evaluating 14 quality dimensions and aggregating them into comprehensive quality reasoning text via hierarchical CoT, the authors construct the S2I dataset containing 320K+ instructions. Coupled with a two-stage progressive fine-tuning strategy, several video LMMs simultaneously acquire quality scoring and reasoning capabilities, achieving an average SRCC improvement of 26-31% across five VQA datasets.
- SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
-
SeaCache is proposed as a training-free dynamic caching strategy based on Spectral-Evolution-Aware (SEA) filters. By separating signal and noise components in the frequency domain to measure redundancy between timesteps, it significantly improves the latency-quality trade-off for diffusion model inference.
- Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
-
ViPO transforms the scalar advantage of "full-image scoring" in GRPO into a pixel-level, perception-aware structured advantage. It utilizes a training-free Perceptual Structuring Module (PSM) to extract preference allocation maps from pre-trained visual backbones. These maps are multiplied by the scalar advantage, directing optimization pressure toward regions that human eyes truly care about, thereby outperforming the original GRPO (DanceGRPO) in both image and video generation.
- SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
-
SeeThrough3D is proposed to condition the FLUX model via an Occlusion-aware Scene Representation (OSCR) rendered from semi-transparent 3D bounding boxes, achieving precise 3D layout control and occlusion-consistent text-to-image generation.
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
-
The SegQuant framework is proposed, which achieves high-fidelity post-training quantization for diffusion models that is generalizable across architectures and compatible with deployment pipelines. This is achieved through semantics-aware segmentation quantization (SegLinear) based on static computation graphs and hardware-native dual-scale polarity-preserving quantization (DualScale), without relying on manual rules or runtime dynamic information.
- Self-Corrected Image Generation with Explainable Latent Rewards
-
Ours proposes the xLARD framework, which performs semantic self-correction in latent space during text-to-image generation via a lightweight residual corrector. It leverages explainable latent reward signals (counting/color/position) to guide generation, achieving a +4.1% improvement on GenEval and +2.97% on DPGBench, while adapting to multiple backbones in a plug-and-play manner.
- Self-Evaluation Unlocks Any-Step Text-to-Image Generation
-
This paper introduces the Self-Evaluating Model (Self-E), enabling a text-to-image model trained from scratch to learn local velocity fields like flow matching while simultaneously using its own current scoring as a "dynamic self-teacher." Without pre-trained teachers or distillation, it achieves a single model supporting any-step inference—producing high-quality images in 2 steps while competing with top-tier flow matching models at 50 steps.
- Semantic Context Matters: Improving Conditioning for Autoregressive Models
-
SCAR replaces the "prefix conditioning" of autoregressive image editing from lengthy, semantically sparse VQ tokens to dense semantic prefixes (Compressed Semantic Prefilling) extracted by a frozen visual foundation model and compressed 4× via a learnable module. During decoding, an auxiliary loss is used to align the "internal hidden states" of the source image with the target image semantics (Semantic Alignment Guidance), achieving higher visual quality and instruction consistency for both next-token and next-set AR paradigms while reducing training memory by ~24% and increasing speed by ~1.4×.
- Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
-
The interaction relationship of "subject → predicate → object" is constructed as a directed acyclic interaction graph. A "Derivative Attention" mechanism is proposed to force predicate semantics to derive from the subject and object semantics to derive from the predicate. A region refinement module then back-injects visual features into graph nodes in real-time. This achieves semantically coherent and spatially reasonable human-object interaction images on HICODet, reaching SOTA in both FID and HOI detection mAP.
- Semantic Scale Space: A Framework for Controllable Image Abstraction
-
This paper reformulates "image abstraction" as a two-dimensional space spanned by smoothing intensity \(t\) and semantic granularity \(g\) (Semantic Scale Space, SSS). By externalizing the decision of "which structures to preserve" from the smoothing process via a controllable boundary detector, it introduces a specific traversal strategy called AGSS (unidirectional donor-gated diffusion + fine-to-coarse scheduling). Under equivalent smoothing levels, AGSS retains significantly more semantic boundaries with lower geometric drift compared to classic baselines, and its downstream NPR stylization results are strongly preferred by users.
- Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
-
SFD decouples semantics and texture into two latent paths within latent diffusion. By using independent noise schedules, semantics are denoised "one step ahead" of texture, serving as a structural blueprint to guide texture refinement. This achieves an FID of 1.04 on ImageNet 256×256 and accelerates training convergence by approximately \(100\times\) compared to DiT.
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
-
SenCache replaces empirical heuristics for cache reuse in diffusion models with a first-order estimate of the denoising network's local sensitivity (the Jacobian norm of the output with respect to latent and timestep perturbations). By reusing the cache only when the predicted output variation is below a tolerance \(\varepsilon\), it skips redundant forward passes in a per-sample adaptive manner without retraining or architectural changes, achieving higher visual quality on Wan 2.1 / CogVideoX / LTX-Video for the same computational budget.
- ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
-
ShapeAR reformulates "raster-to-vector" as a generative layered stacking task. Using latent space flow-matching diffusion conditioned on both the original image (global context) and a partial composite of previously generated layers (local context), it autoregressively generates sets of non-overlapping RGBA shape layers. This approach recovers "artist-style" complete and reorderable closed shapes, outperforming previous SOTA on multiple vectorization metrics.
- ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
-
ShowTable introduces the new task of "Creative Table Visualization" (converting data tables into infographics) and designs a progressive self-correction pipeline coordinating MLLM (reasoning + reflection) with Diffusion models (generation + refinement). Through a specifically trained rewriting module and a refinement module optimized via RL, it significantly enhances the visualization quality of all baseline models on the self-constructed TableVisBench benchmark.
- ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
-
ShowUI-π transfers "Flow-matching VLA," typically used for dexterous manipulation in robotics, to the GUI domain. By employing a 450M lightweight action expert, it unifies clicking and dragging into continuous coordinate trajectories. This enables agents to perform high-degree-of-freedom dragging tasks requiring real-time adjustment, such as rotation, drawing, and solving slider captchas. The authors also release the ScreenDrag dataset and online/offline evaluation benchmarks.
- SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
-
This paper proposes SimLBR, which uses Latent-space Blending Regularization (LBR) to mix sparse fake image information into real image embeddings within the DINOv3 latent space. This forces the detector to learn a compact decision boundary around the real image distribution, achieving strong generalization to unseen generators. It reaches an average accuracy of 94.54% on GenImage and improves accuracy by 25% and recall by 70% over AIDE on the difficult Chameleon test set.
- SimplePoster: A Simple Baseline for Product Poster Generation
-
Aiming at the two primary requirements of e-commerce product poster generation—no distortion of the subject and precise placement of multi-line text—SimplePoster removes the stacked ControlNet / OCR encoders used in existing methods. By relying solely on "full-parameter fine-tuning of FLUX-Fill" to eliminate subject extension and "zero-cost Character Position Encoding" to achieve layout-controllable text, the subject preservation rate is increased from PosterMaker's 85.3% to 98.7%, with text accuracy also leading comprehensively.
- SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation
-
This paper analyzes the bottleneck of Speculative Jacobi Decoding (SJD) in text-to-image generation, specifically the severely skewed distribution of its acceptance lengths. It introduces the SJD-PAC framework, which incorporates two techniques: Proactive Drafting (PD) and Adaptive Continuation (AC). Under strictly lossless conditions, SJD-PAC achieves a 3.8× inference speedup, significantly outperforming the ~2× speedup of original SJD.
- SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
-
SketchAssist unifies "sketch editing via text instructions" and "local redrawing via hand-drawn lines" into a single DiT framework. By utilizing a controllable data synthesis pipeline to generate structure-aligned paired training samples and employing a 3-channel unified input representation with Task-routed MoE (T-MoE), the model achieves seamless switching between editing modes and attains SOTA performance in both tasks.
- SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
-
SketchDeco is proposed as a training-free sketch colorization method. It employs a global-local two-stage strategy using region masks and color palettes as precise control signals. By utilizing diffusion model inversion and self-attention injection, it achieves precise regional coloring and harmonious global transitions in latent space, completing in 15-20 steps on consumer-grade GPUs.
- SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
-
SketchRevive introduces the new task of "Fine-Grained Pixel-to-Vector Sketch Completion" using a two-stage framework: a diffusion model first performs structurally consistent completion at the pixel level, followed by an MLLM for structure-aware refinement and vectorization. By injecting intermediate diffusion features into the MLLM visual stream, the framework significantly outperforms naive cascades of ControlNeXt with GPT-5 or Gemini across metrics like FID, IoU, and SRR.
- SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design
-
SkyReels-Text models "text replacement" as a region-level editing task. By using a user-cropped glyph patch as an explicit visual condition injected through a dual-stream VAE, it achieves zero-shot font transfer. This approach accurately replaces text content while precisely replicating arbitrary fonts (including handwriting and artistic styles), achieving SOTA in text fidelity and font consistency across multiple benchmarks.
- SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
-
SliderEdit introduces "sliders" for each sub-instruction in instruction-based image editing models (e.g., FLUX-Kontext, Qwen-Image-Edit). By utilizing a shared set of Low-Rank Adaptors combined with a partial prompt steering loss, it allows users to continuously and decouply adjust the intensity of each edit—from zero application to exaggerated levels—without requiring separate training for each attribute.
- Smoothing the Score Function to Enhance Generalization in Diffusion Models
-
This paper theoretically demonstrates that memorization in diffusion models (where generated samples verbatim copy training samples) originates from the empirical score function being a sum of Gaussian components with "sharp softmax weighting," causing a single training point to dominate sampling and lead to collapse. Accordingly, two methods for smoothing weights—Noise Unconditioning and Temperature Smoothing—are proposed, significantly enhancing generalization and mitigating memorization with minimal loss in generation quality.
- SOLACE: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
-
The denoising self-confidence of a T2I model (its precision in recovering injected noise) is utilized as an intrinsic reward for post-training, substituting external reward models. This approach yields consistent improvements in compositional generation, text rendering, and image-text alignment, while complementing external rewards to mitigate reward hacking.
- SparVAR: Exploring Sparsity in Visual Autoregressive Modeling for Training-Free Acceleration
-
This paper performs a systematic analysis of attention activation patterns in VAR models, revealing three major sparsity characteristics (attention sinks, cross-scale similarity, and spatial locality). It proposes the SparVAR training-free acceleration framework, which incorporates two plug-and-play modules—Cross-Scale Self-Similar Sparse Attention (CS⁴A) and Cross-Scale Local Sparse Attention (CSLA)—to achieve 1-second level generation for 1024×1024 images using an 8B model (1.57× acceleration) with almost no loss in high-frequency details.
- Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
-
This paper proposes Spatial-SSRL, a self-supervised reinforcement learning paradigm. By automatically constructing five pretext tasks (patch reordering, flip identification, crop inpainting, depth ordering, and relative 3D position prediction) from standard RGB/RGB-D images, it utilizes GRPO to optimize the spatial understanding capabilities of LVLMs. This approach achieves an average improvement of 3.89%-4.63% across seven spatial benchmarks without requiring human annotations or external tools.
- SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling
-
SpatialDiff injects implicit spatial priors from a single image into a Diffusion Transformer via a 3D geometric encoder, supplemented by latent space depth supervision, without performing explicit 3D reconstruction. This allows instruction-driven image editing to "properly relocate" objects in complex scenes involving occlusions and multiple depth layers.
- SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
-
SpatialReward is a "verifiable" spatial reward model for text-to-image (T2I) generation. It first decomposes free-form text into structured constraints, then uses expert models such as object detection and OCR to objectively verify the generated images. Finally, it utilizes a Vision-Language Model (VLM) for Chain-of-Thought (CoT) reasoning based on verified facts to provide spatial reward scores. Integration with Flow-GRPO significantly enhances the spatial consistency of SD3.5-M and FLUX (SpatRelBench overall improved from 0.23 to 0.42 and 0.28 to 0.46, respectively).
- Spatiotemporal Pyramid Flow Matching for Climate Emulation
-
The "coarse-to-fine" pyramid flow matching is extended to both spatial and temporal dimensions, proposing Spatiotemporal Pyramid Flow (SPF). Using a DiT network for parallel sampling of decadal/yearly/monthly climate fields in pixel space, it achieves 15–28× faster speeds than autoregressive climate emulators while attaining superior CRPS/RMSE on ClimateBench.
- SPDMark: Selective Parameter Displacement for Robust Video Watermarking
-
SPDMark proposes an in-model video watermarking framework based on Selective Parameter Displacement (SPD). By learning a dictionary of low-rank base shifts in the decoder and combining them based on a watermark key, it achieves per-frame embedding, imperceptibility, high robustness, and low computational overhead, while supporting temporal tampering detection and localization.
- SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model
-
SpeeDiff dismantles the two-stage pipeline in Latent Diffusion Models (LDM)—where the VAE is trained first and then frozen—by enabling joint training of the VAE and the diffusion model from scratch without stop-gradients. The key innovation is a Tweedie Pixel Reconstruction (TPR) loss that "anchors" diffusion gradients back to the pixel space, preventing latent collapse. It achieves a gFID of 1.50 (without guidance) on ImageNet 256×256, with training speeds 140× faster than Vanilla SiT and 61× faster than REPA.
- Spherical Leech Quantization for Visual Tokenization and Generation
-
This paper unifies non-parametric quantization (NPQ) methods like LFQ, FSQ, and BSQ into the language of "lattice codes." It identifies that entropy regularization essentially performs lattice relocation. Consequently, based on the "densest sphere packing" principle, the authors derive \(\Lambda_{24}\)-SQ using the 24-dimensional Leech lattice. This pushes the visual codebook size to approximately 200,000, enabling tokenizer training without any entropy or commitment regularization. It also marks the first time a discrete visual autoregressive model achieves a near-oracle 1.82 gFID on ImageNet-1k using a ~200k codebook.
- Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams
-
Addressing the limitations of Spike Camera Super-Resolution (SCSR) in handling fixed short sequences and spike signal fluctuations, Spk2VidNet employs "dual-layer recurrent propagation with expanding temporal receptive fields + multi-frame consistency alignment + content-aware modulation fusion + segmented training state transfer" to reconstruct high-resolution image sequences from arbitrary long spike streams. It sets a new SOTA on synthetic and real data with faster speeds (REDS-LSSR ×4 PSNR 29.92dB, 43ms inference).
- SplitFlux: Learning to Decouple Content and Style from a Single Image
-
This work systematically analyzes the functional division of blocks within the FLUX model, discovering that single stream blocks are essential for image generation, with the early stages controlling content and the late stages controlling style. Based on this, SplitFlux fine-tunes these blocks using LoRA for content-style decoupling from a single image. By incorporating Rank-Constrained Adaptation (RCA) to preserve identity and Visual-Gated LoRA (VGRA) to enable re-contextualization, this method significantly outperforms SDXL and FLUX baselines in content fidelity.
- SpotEdit: Selective Region Editing in Diffusion Transformers
-
SpotEdit is a training-free DiT image editing framework that exploits the phenomenon where "non-edited regions converge rapidly in the early stages of denoising." It utilizes perceptual similarity to automatically identify stable tokens, removes them from DiT computation to reuse conditional image features, and combines this with a time-annealed KV fusion mechanism to maintain context. It achieves a 1.7×–1.95× speedup on FLUX.1-Kontext with almost no loss in editing quality.
- SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion
-
SPREAD formulates "how to place objects in a physically plausible manner" as a guided diffusion framework: it uses a Graph Transformer to jointly encode spatial and physical relationship graphs, "observes" collisions and interpenetrations between noisy meshes via a Geometry Aware Perceiver during each denoising step, and employs a three-way differentiable guidance (Collision / Gravity / Support) during inference to push objects into physically consistent poses. This generates 3D indoor scenes that are almost entirely stable in Isaac Sim, making them directly applicable to Embodied AI.
- SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
-
SRA 2 directly utilizes the existing SD-VAE encoding features from the first stage of latent diffusion as supervision signals. By using a lightweight MLP to project intermediate SiT features for alignment, it accelerates diffusion Transformer training convergence by up to \(7\times\) with only a 4% increase in GFLOPs, without introducing external representation encoders or maintaining a dual-model teacher.
- Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
-
Given an object mesh and its trajectory, this paper employs a flow matching framework to generate full-body motions for two individuals collaborating on object transport. Through three modules—affordance-guided contact strategy, adversarial interaction prior, and sampling-based stability simulation—the generated motions simultaneously satisfy intentional correctness (correct grasping), natural poses, and physical stability (minimal floating and penetration). On Core4D, it significantly outperforms existing HOI baselines in contact accuracy, penetration depth, and distributional fidelity.
- STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
-
STCDiT performs real-world video super-resolution (VSR) based on a pre-trained video diffusion model (Wan2.1). It addresses VAE reconstruction distortions under complex camera movements using "Motion-Aware VAE Segmented Reconstruction" and injects well-preserved spatial structure information from the first-frame latent of each segment into the generation process via "Anchor-Frame Guidance." By adding only approximately 7% of the trainable parameters of a standard LoRA, it surpasses SOTAs such as SeedVR and STAR in structural fidelity and temporal consistency.
- Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge
-
SimuSDB models the task of "predicting the morphological change of an unperturbed cell image under specific chemical/genetic perturbations" as a stochastic diffusion bridge from the source cell distribution to the perturbed distribution. By using a conditional Brownian bridge, trajectories are allowed to diverge around a deterministic backbone to capture phenotypic diversity. The constraint that "generative results must match a specific perturbed phenotype" is reformulated as a stochastic optimal control problem to steer the drift term. SimuSDB outperforms diffusion, flow matching, and GAN baselines in FID/KID across BBBC021, RxRx1, and JUMP benchmarks.
- Stepwise-Flow-GRPO: Assigning Stepwise Credit to Denoising Steps in Flow-Matching Models
-
To address the flaw in Flow-GRPO that distributes the "same final image advantage" uniformly across all denoising steps, this paper uses the Tweedie formula to estimate intermediate rewards. It employs "adjacent step reward gain" as the stepwise advantage for GRPO and incorporates a DDIM-style SDE to improve sampling quality, achieving higher sample efficiency and faster convergence in text-to-image RL.
- StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
-
A two-stage autoregressive adaptation framework (autoregressive distillation + adversarial refinement) is proposed to transform bidirectional human video diffusion models into real-time streaming generators. By utilizing Reference Sink, RAPR positional re-encoding, and a consistency-aware discriminator to ensure long-video stability, it achieves the first full-body real-time digital human supporting both talking and listening interactions.
- StreamDiT: Real-Time Streaming Text-to-Video Generation
-
StreamDiT proposes a comprehensive streaming video generation solution (including training, modeling, and distillation). By introducing a sliding buffer with progressive denoising in Flow Matching and a mixed partition training strategy, combined with a time-variant DiT architecture with window attention and a customized multi-step distillation method, a 4B parameter model achieves real-time streaming video generation at 512p@16FPS on a single GPU.
- Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
-
SDMFusion distills a pre-trained diffusion model into a "one-step sampling + streaming memory" framework for infrared-visible video fusion. It utilizes single-step residual sampling for real-time speed, gated temporal aggregation adapters with optical flow-aligned memory for inter-frame coherence, and a temporal consistency loss to suppress flickering and ghosting, achieving SOTA quality and the fastest inference across four benchmarks.
- Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
-
Addressing the persistent "style leakage + semantic drift" issue in diffusion-based editing models for style transfer, this work constructs a preference dataset, StyleReward-Dataset, containing 300,000 adversarial image pairs. A multimodal reward model, StyleScore, is trained to simultaneously evaluate style consistency and content fidelity. By employing a two-stage "SFT Domain Adaptation + GRPO Preference Optimization" pipeline, FLUX.1[Kontext] is fine-tuned to achieve SOTA performance. It leads in both style fidelity and content preservation on ImgEdit/AnyEdit benchmarks and was selected as the top choice by 87.5% of participants in a user study.
- StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks
-
StyleDoctor replaces general human preference reward models with a "specialist style reward model" based on a multimodal large language model (Qwen2.5-VL-3B). By constructing SPRData, a style preference dataset containing 400,000 "quadruplets," and employing a three-stage training process, the model learns to perceive both image style and text style semantics. It ultimately serves as a reward signal for the reinforcement fine-tuning of diffusion models, significantly enhancing style consistency in style-centric generation and transfer tasks.
- StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
-
StyleGallery is a training-free semantic-aware style transfer framework. It first performs unsupervised semantic clustering on content images using intermediate diffusion features, then adaptively matches content regions with the most relevant regions from arbitrary style references across statistical, semantic, and geometric dimensions. Finally, it employs regional style loss to guide diffusion sampling, achieving interpretable and customizable fine-grained style transfer without requiring external masks.
- StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
-
StyleTextGen models "generating scene text based on the style of a reference image" as a DiT diffusion inpainting task. It utilizes a dual-branch style encoder (a text branch for extracting glyph textures and a vision branch for capturing global tones) to extract style embeddings decoupled from the background. By incorporating a style consistency loss calculated exclusively within text regions and an inference strategy that injects reference KV only during the first 10 steps, it achieves SOTA results in style similarity and character accuracy for both monolingual and cross-lingual scene text generation in Chinese and English.
- Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
-
CompGen defines "compositional difficulty" through the structural complexity of scene graphs, utilizes adaptive MCMC to sample scene graphs within specified difficulty intervals to construct training prompts, and integrates "easy-to-hard" curriculum weights into the rewards of Group Relative Policy Optimization (GRPO). Without requiring any ground-truth images, this approach improves the compositional generation capabilities of diffusion and autoregressive T2I models by an average of 7~12 points.
- SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
-
SynthRGB-T reformulates infrared \(\leftrightarrow\) visible image translation as "vision-language guided denoising diffusion." It utilizes foundation models to automatically extract foreground semantic priors and injects decoupled foreground, content, and text conditions into different resolution layers of the U-Net. This enables a single model to perform bidirectional translation and generate diverse results based on text prompts, achieving SOTA on multiple real-world benchmarks for both I2V and V2I directions.
- TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
-
Addressing the severe task interference problem in unified image generation and editing models, this paper proposes the TAG-MoE framework. By using a hierarchical task semantic annotation scheme and predictive alignment regularization, high-level task intent is injected into local MoE routing decisions. This evolves the gating network from a task-agnostic executor into a semantic-aware scheduling center, achieving state-of-the-art (SOTA) performance among open-source models across five benchmarks, including ICE-Bench, EmuEdit, GEdit, and DreamBench++.
- Taming Generative Diffusion Model for Task-Oriented Infrared Imaging
-
Infrared image restoration is reformulated as "one-step diffusion" by using a lightweight predictor to align degraded inputs to the optimal timestep \(\hat t\) on the diffusion trajectory. Combined with wave-domain spectral regularization to preserve thermal radiation characteristics and task-aware low-rank adaptation that switches between downstream tasks (detection/segmentation) via optimizing a few-hundred-dimensional prompt, the method outperforms existing approaches in restoration quality, semantic preservation, and efficiency.
- Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
-
The D2-Align framework is proposed to correct reward signals by learning directional correction vectors in the reward model's embedding space. This addresses the Preference Mode Collapse (PMC) issue in RLHF alignment for diffusion models—where over-optimization of rewards leads to a severe decline in generative diversity. Additionally, the DivGenBench benchmark is introduced to quantitatively evaluate generative diversity.
- Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
-
This paper reveals that the \(\beta\)-VAE tokenizer in Latent Diffusion Models (LDMs) suffers from an overly compact latent space due to variance collapse, making it highly sensitive to diffusion sampling perturbations. It proposes Variance Expansion (VE) Loss to adaptively learn a robust latent space variance through an adversarial balance between reconstruction and variance expansion, consistently improving generation quality (FID 1.18) across multiple diffusion architectures.
- Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
-
WorldForge proposes a completely training-free inference-time guidance framework that transforms pre-trained video diffusion models into 3D/4D generation tools with precise camera trajectory control through three synergistic components: Intra-step Recursive Refinement (IRR), Flow-guided Latent Fusion (FLF), and Dual-path Self-correcting Guidance (DSG), surpassing both training-based and inference-based baselines in trajectory accuracy and perceptual quality.
- TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
-
The TAP framework is proposed to adaptively select the optimal predictor (from the Taylor expansion family) for each token at every step through a first-layer probe. It achieves training-free diffusion acceleration, reaching a \(6.24\times\) speedup on FLUX.1-dev with no perceptible quality loss.
- TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
-
The authors propose TC-Padé, a feature residual prediction framework based on Padé rational function approximation. Through adaptive coefficient adjustment and stage-aware strategies, it achieves trajectory-consistent acceleration (2.88× for FLUX.1-dev, 1.72× for Wan2.1) in low-step (20-30 steps) diffusion sampling scenarios, significantly outperforming existing methods based on Taylor expansion.
- Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
-
Addressing the training collapse of MeanFlow in one-step generation when increasing the proportion of "trajectory samples," this paper identifies the root cause as a severe imbalance in gradient variance across different temporal scales. It proposes two modifications with zero additional inference overhead: "Temporal Equilibrium Weighting" and "Dynamic Boundary Scheduling," pushing the 1-NFE FID on ImageNet 256×256 to 2.62, outperforming all diffusion-based or flow-based one-step methods.
- Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
-
Instead of modifying model weights or perturbing noise/latents, this method optimizes only the "null-text embedding" within Classifier-Free Guidance (CFG). This allows the diffusion model to align with target rewards during the inference phase. Since the text embedding space is a structured semantic manifold, this approach achieves SOTA rewards without "cheating" via non-semantic noise (reward hacking).
- Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
-
This paper proposes Composer, a plug-and-play meta-generator framework that dynamically generates low-rank parameter updates based on each input condition during inference and injects them into pre-trained model weights. With extremely low computational overhead (+0.2% time, +3.6% memory), it achieves instance-specific adaptive high-quality image generation, significantly enhancing performance across scenarios such as class-conditional generation, text-to-image, post-training quantization, and test-time scaling.
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
-
TextPecker is proposed as a plug-and-play structural anomaly-aware RL strategy. By constructing a character-level structural anomaly annotation dataset to train a structural-aware recognizer, it replaces noisy OCR reward signals. This approach jointly optimizes semantic alignment and structural fidelity, significantly enhancing visual text rendering quality across multiple text-to-image models (FLUX, SD3.5, Qwen-Image).
- Texvent: Asynchronous Event Data Simulation via Text Prompt
-
Texvent directly generates asynchronous event camera data from text prompts. It first utilizes a Multi-modal Large Language Model (MLLM, e.g., Cosmos) to render text into video, then converts the video into an event stream via a novel training-free physical simulator. By employing "Luminance-aware Interpolation + Balanced Log-intensity Contrast + Luminance Caching," it achieves significantly higher fidelity than cascaded baselines while maintaining near-optimal generation speed.
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
-
ImageCritic formulates "fixing detail inconsistencies in customized generated images" as a reference-guided post-editing task. Specifically, it constructs 10k reference-degraded-target triplets using VLM screening and Flux-Fill active degradation. By introducing an attentive alignment loss and a detail encoder on Flux Kontext, the model precisely locates and aligns fine-grained details like text and logos. An Agent chain is further utilized to achieve automated multi-turn correction.
- The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
-
Addressing the "attention collapse" problem in training-free non-rigid image editing, this paper proposes SynPS: it first quantifies the editing degree per step via the ratio of image similarity to text similarity, and then dynamically scales the relative distance of RoPE in attention sharing. This adaptively balances "preserving source structure" and "following target semantics," achieving a new SOTA in MLLM scores on PIE-Bench and self-built benchmarks.
- The Drift Kernel: Why Diffusion Models Change Even When Told Not To
-
When a diffusion model is instructed to "change nothing," it still subtly modifies the input. This paper quantifies this "no-op drift" as a Drift Kernel \(K_M(\sigma)\approx k_M\sigma^2+c_M\), which grows quadratically with noise strength \(\sigma\). Based on the first-order Taylor expansion of the decoder Jacobian and verified on 120,000 image pairs, the study proves that drift is a structural property of the decoder rather than an issue with prompt phrasing.
- The Universal Normal Embedding
-
This paper proposes the Universal Normal Embedding (UNE) hypothesis: generative models (diffusion models) and visual encoders (CLIP, DINO) share an underlying geometric structure in their latent spaces that is approximately Gaussian. Both can be viewed as noisy linear projections of this shared space. The hypothesis is validated through the NoiseZoo dataset and extensive experiments, demonstrating the capability to perform direct linear semantic editing within the DDIM inversion noise space.
- TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
-
TherA employs a "thermal-aware" Vision-Language Model (TherA-VLM) to infer structured thermal semantic embeddings—covering scene, object, material, and heat states—from RGB images. These embeddings are injected into a latent diffusion model for conditional TIR generation. This approach upgrades RGB-to-TIR translation from simple "style transfer" to "thermally-consistent" controllable synthesis, outperforming SOTA in zero-shot translation by up to 33%.
- ThinkGen: Generalized Thinking for Visual Generation
-
ThinkGen explicitly integrates the MLLM
<think>Chain-of-Thought (CoT) into image generation. It utilizes a decoupled "MLLM thinking + DiT rendering" architecture and SepGRPO training that alternately reinforces the MLLM and DiT. This enables the model to automatically trigger CoT reasoning across various scenarios such as text-to-image, text rendering, image editing, and reasoning-based generation, achieving SOTA performance on benchmarks including GenEval (0.89), CVTG (0.84), and ImgEdit (4.21). - Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
-
This paper proposes TWIG (Thinking-while-Generating), the first text-to-image framework where textual reasoning "intervenes during generation." By inserting textual thoughts regionally during autoregressive image generation, it provides local guidance for the next segment and performs scoring/error correction for segments just completed. Evaluated via zero-shot, SFT, and RL routes, it improves color binding on T2I-CompBench from 63.6 to 82.5 using Janus-Pro-7B.
- TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
-
Ours proposes TINA (Text-free INversion Attack), which identifies precise initial noise by optimizing DDIM inversion under the null-text condition. This bypasses all text-based concept erasure defenses and demonstrates that current erasure methods only sever text-to-image mappings without truly deleting internal visual knowledge from the model.
- TokenLight: Precise Lighting Control in Images using Attribute Tokens
-
Ours proposes TokenLight, formulating image relighting as an end-to-end generation task conditioned on attribute tokens (intensity, color, ambient light, diffuse level, and 3D light position). It achieves precise, continuous, and interpretable lighting control within a Diffusion Transformer framework.
- Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
-
Addressing the issue where T2I generated images are "too vivid to look like real photos," this paper proposes the Color Fidelity Dataset (CFD, 1.3M images), the Color Fidelity Metric (CFM, based on Qwen2-VL + softrank loss), and Color Fidelity Refinement (CFR, a training-free spatio-temporal adaptive guidance modulation), forming an integrated evaluation-improvement framework.
- Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
-
The authors use frequency perturbation experiments to dissect the high-dimensional trade-off in latent diffusion—where better reconstruction often leads to worse generation. The root cause is identified as the decoder's extreme reliance on high-frequency latent components, while the encoder tends to discard them. Based on this, FreqWarm is proposed: a curriculum that feeds low-pass filtered images to the diffusion model during early training for high-frequency "warm-up" before switching back to full-frequency fine-tuning. This approach reduces the gFID of several high-dimensional VAEs by 4–14 points without modifying any autoencoders.
- Toward Early Quality Assessment of Text-to-Image Diffusion Models
-
This paper proposes Probe-Select, a lightweight probe attached to the intermediate activations of diffusion denoisers. By running only 20% of the generation trajectory, it predicts the final quality score of an image, allowing for the early pruning of unpromising random seeds. This approach reduces the sampling overhead of the "generate-then-select" pipeline by approximately 64% while simultaneously improving the quality of the retained images.
- Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models
-
To address the spatially sparse reward problem in DPO-based diffusion model alignment—where a single preference label is assigned to an entire image—IAPO utilizes a VLM and a detector to automatically annotate an instance-level preference dataset. By employing an instance alignment loss with a dynamically reweighted mask, it refines credit assignment from image-level granularity to individual object granularity, achieving SOTA performance on multiple benchmarks with a training efficiency 3.27x higher than InPO.
- Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
-
MagicBokeh unifies "Super-resolution for high digital zoom" and "Bokeh rendering" within a single-step diffusion framework. It resolves optimization conflicts between the two tasks through an alternating training strategy and focus-aware masked attention, while employing a degradation-aware depth module to estimate reliable disparity maps from low-quality inputs. The model achieves more realistic bokeh than "SR followed by Bokeh" two-stage pipelines on real low-resolution phone photos with 0.1s-level speed.
- Towards Robust Sequential Decomposition for Complex Image Editing
-
Addressing complex image editing where "multiple interdependent operations are packed into a single instruction," this work investigates "sequential decomposition" within an in-context editing framework. High-quality editing chains with decomposition labels are synthesized via Blender to fine-tune BAGEL. A Context-Guided Sequential Editing (CGSE) paradigm is designed to regulate the influence of "historical editing results," ensuring that performance improves with more decomposition steps and enabling successful sim-to-real transfer to real images through co-training.
- Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
-
STALL is proposed as a training-free, zero-shot generated video detector. By jointly modeling per-frame spatial likelihoods and inter-frame temporal likelihoods in a whitened embedding space, it achieves robust detection across various generative models using only real-video calibration.
- Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
-
To address the slow inference of Diffusion Transformers (DiT), this paper proposes RALU (Region-Adaptive Latent Upsampling), a training-free method. It performs initial denoising in a low-resolution latent space (1/4 tokens), applies early upsampling only to edge-prone regions, and uses NT-Matching to realign deviated noise and timestep distributions. It achieves a 7.0× speedup on FLUX, reaching up to 15.9× when combined with temporal acceleration and distillation, while maintaining nearly original image quality.
- Training-free Motion Factorization for Compositional Video Generation
-
A motion factorization framework is proposed to decompose the motion of multiple instances in a scene into three categories: static, rigid, and non-rigid. It addresses semantic ambiguity in prompts through Structured Motion Graph Reasoning (SMR) and regulates the generation of these three motion types during the diffusion process via Decoupled Motion Guidance (DMG). Without additional training, it significantly improves motion diversity and fidelity on VideoCrafter-v2.0 and CogVideoX-2B.
- Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
-
To alleviate the high-resolution (HR) computational burden during the user's "seed/prompt trial-and-error" stage, this paper proposes a training-free low-resolution (LR) "preview" generation method. The goal of "perceptual consistency between LR and HR" is reformulated as a commutator-zero condition between the flow matching model and the downsampling operator. This condition is approximately satisfied during sampling via "optimal downsampling matrix selection" and "commutator-zeroing guidance," saving up to 33% of computation while preserving composition and color consistency. When combined with temporal acceleration, it achieves a 3.05× speedup.
- Transition Models: Rethinking the Generative Learning Objective
-
TiM generalizes the "infinitesimal step" PF-ODE supervision of diffusion models into a state transition identity that holds exactly for any time interval \(\Delta t\). This allows an 865M small model to both perform 1-step generation and improve monotonically with increased sampling steps, outperforming SD3.5 (8B) and FLUX.1 (12B) despite its significantly smaller parameter count on GenEval.
- Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances
-
This work upgrades UWB ranging between 6 IMUs from "extra features" to "geometric constraints." It first uses Multidimensional Scaling (MDS) to reconstruct 3D sensor layouts from pairwise distances as diffusion conditions, then employs forward kinematics during denoising sampling to align predicted poses with sensor distances via guidance, reducing joint position error in sparse inertial pose estimation by up to 22%.
- Understanding, Accelerating, and Improving MeanFlow Training
-
This paper dissects the training dynamics of MeanFlow when simultaneously learning "instantaneous velocity \(v\) and average velocity \(u\)" through controlled experiments. It discovers that \(v\) must be established first, and that \(u\) supervision at small time intervals \(\Delta t\) is beneficial while large intervals are detrimental. Based on this, a training scheme featuring "accelerated \(v\) formation + progressive \(L_u\) weighting (prioritizing small intervals and gradually transitioning to interval balance)" is designed. Using the same DiT-XL backbone, it reduces the 1-NFE FID on ImageNet \(256 \times 256\) from 3.43 to 2.87 and achieves approximately 2.5× convergence acceleration.
- Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
-
Uni-DAD is proposed as the first method to unify diffusion model distillation and adaptation into a single-stage pipeline. By employing a dual-domain DMD loss and a multi-head GAN loss, it achieves high-quality and diverse generation in few-shot domains with only 1–4 sampling steps.
- UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
-
UniEdit-I utilizes the semantic latent space (CLIP features) of unified Vision-Language Models (VLMs) as an editable canvas and introduces an "Understanding-Editing-Verifying" (UEV) closed loop. By using the VLM to parse instructions, traverse FlowEdit trajectories in the CLIP space, and provide real-time feedback to dynamically adjust editing intensity or determine early stopping/retries, it achieves state-of-the-art open-source performance on GEdit-Bench, approaching GPT-4o, without any fine-tuning or structural modifications.
- Unified Customized Generation by Disentangled Reward Modeling
-
USO (Unified Simultaneous Optimization) unifies "subject-driven generation" and "style-driven generation" as complementary tasks within a single DiT model. By constructing cross-task triplet data via two expert models, followed by joint training with disentangled encoding, random condition dropout, and Auxiliary Style Reward (ASR), it achieves open-source SOTA performance in subject consistency, style similarity, and text controllability simultaneously.
- Unified Latent Space for Understanding and Generation via Semantic Auto-encoder
-
Addressing the fundamental trade-off where "semantic encoder latent spaces possess semantics but lose geometry, while reconstruction VAE latent spaces possess geometry but lack semantics," this paper utilizes a frozen DINOv3 as the encoder, combined with two-stage progressive training and a semantic regularization loss that aligns the student encoder with teacher features. The result is a unified latent space, the Semantic Auto-encoder (S-AE), which simultaneously supports high-fidelity reconstruction (rFID 0.06) and linear probing classification (ImageNet 81.9%).
- Unified Vector Floorplan Generation via Markup Representation
-
This paper proposes Floorplan Markup Language (FML) to encode floorplan elements such as rooms and doors into structured token sequences. A LLaMA-style Transformer model (FMLM) is employed to uniformly solve various floorplan generation tasks, including unconditional, boundary-conditioned, graph-conditioned, and completion tasks, achieving FID metrics over 80% lower than HouseDiffusion.
- UniGenDet: A Unified Generative-Discriminative Framework for Co-evolutionary Generation and Detection
-
UniGenDet integrates "faking" (image generation) and "fake-detecting" (AI-generated image detection) into a single unified multi-modal model for two-stage joint training. By employing Symbiotic Multimodal Self-Attention to inject the generator's understanding of image distributions into the detector, and using a frozen detector as an "authenticity teacher" to reversely align generator features, the framework facilitates mutual advancement in a closed loop. Consequently, detection accuracy (98.0% Acc on FakeClue) and generation fidelity (FID 22.9 \(\rightarrow\) 17.5) are simultaneously improved.
- UniPercept: A Unified Diffusion Model for Generalizable Visual Perception
-
UniPercept transforms a DiT diffusion model into a universal visual perception framework using a "shared foundation + lightweight adapters." The foundation learns general perception priors through joint training on 7 tasks (depth, normals, albedo, segmentation, etc.). New tasks can be efficiently adapted by training a small adapter (<1% parameters) with only 1,000 samples. Across 14 tasks, it mostly outperforms unified generative models and approaches the performance of task-specific models.
- UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
-
UniVerse utilizes a unified "Reference Condition Extractor (RCE)" to simultaneously extract visual condition latent variables and text modulation offsets from unsegmented in-the-wild photos based on reference prompts. This achieves segmentation-free, disentangled, and composable multi-concept personalized generation on Diffusion Transformer, outperforming existing methods on XVerseBench and the newly proposed UniVerseBench.
- UniVerse: Empower Unified Generation with Reasoning and Knowledge
-
Addressing the issue where unified multimodal models "understand complex prompts but fail to generate correctly," this paper constructs UniVerse—a dataset of 120k samples consisting of "implicit prompt → reasoning chain → explicit prompt" triplets paired with ground-truth images. By proposing CoT injection training to explicitly integrate the reasoning process into the generation pipeline, the authors significantly and consistently improve the reasoning and knowledge-based generation of Bagel on WISE and R2I-Bench.
- VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
-
In autoregressive (AR) image generation, the training objectives for the tokenizer and the generator are disconnected (one learns pixel reconstruction, the other learns token likelihood). VA-π formulates their alignment as a variational objective (ELBO) and utilizes reinforcement learning to treat "whether decoding can reconstruct the original image" as a pixel-level reward to fine-tune the AR generator. Using only 1% of ImageNet data for 25 minutes, it reduces the FID of LlamaGen-XXL from 14.36 to 7.65.
- VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
-
To address slow sampling in Rectified Flow (RF) models, this paper proposes VDE: decomposing the predicted velocity at each step into parallel and orthogonal components relative to the current input. By leveraging the observation that "scalar coefficients are approximately linear over time and orthogonal directions remain nearly constant in the short term," the method uses linear extrapolation and direction reuse to estimate velocity directly from the current input during most steps, skipping model forward passes. This achieves 2.04–3.22× speedup on FLUX/Qwen-Image/Wan2.1 with almost no quality loss (e.g., LPIPS on Qwen-Image is reduced by 52.2% compared to the strongest baseline).
- VecGlypher: Unified Vector Glyph Generation with Language Models
-
VecGlypher is proposed as the first unified language model for both text- and image-guided vector glyph generation. Through two-stage training (large-scale SVG syntax learning + expert label alignment), it directly generates editable SVG paths autoregressively without intermediate raster steps or vectorization post-processing.
- VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
-
VectorArk redesigns the "Raster to Vector (SVG)" task into a generative-model-friendly rounded polygon representation. Combined with outline-based input, vectorization-driven degradation training, and DINO-ranked test-time scaling, a multimodal LLM with only 1B parameters significantly outperforms StarVector and OmniSVG in geometric completeness and artifact removal on real-world tasks (including T2I outputs).
- VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
-
A frozen Vision Foundation Model (VFM, such as SigLIP2-Large) is directly utilized as the VAE encoder for Latent Diffusion Models (LDMs), paired with a dedicated multi-scale decoder to reconstruct semantic features into realistic images. This approach bypasses the representation degradation caused by "distillation alignment." Consequently, on ImageNet \(256 \times 256\), it achieves a gFID (without CFG) of 2.22 in only 80 epochs (approximately \(10\times\) faster than previous tokenizers) and further reaches 1.62 at 640 epochs.
- Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
-
This paper proposes the Vibe Blending task (fusing two images into a coherent hybrid based on their "most relevant shared attributes"—the so-called "vibe") and the Vibe Space method. By using graph diffusion maps to learn a low-dimensional "small-world" manifold in the CLIP/DINO feature space, it transforms原本 curved geodesics into linearly interpolatable paths, generating creative blends that are more human-recognized than those from GPT or Gemini.
- VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
-
VibeToken proposes a "resolution-agnostic" 1D Transformer tokenizer that compresses images of arbitrary resolutions/aspect ratios into 32–256 dynamic-length discrete tokens. Paired with a constant-compute autoregressive generator, VibeToken-Gen, it generates 1024×1024 images (3.94 gFID) using only 64 tokens. The inference FLOPs are 63× lower than LlamaGen, flattening the AR generation compute curve from "quadratic growth with resolution" to a horizontal line.
- VideoCoF: Unified Video Editing with Temporal Reasoner
-
VideoCoF is proposed as a Chain-of-Thought inspired "see → reason → edit" video editing framework. By requiring the video diffusion model to first predict reasoning tokens (grayscale highlighted latents of the editing region) before generating target video tokens, it achieves precise instruction-region alignment without user-provided masks. It reaches SOTA performance with only 50K video pair training and supports video length extrapolation up to 16 times the training length.
- ViHOI: Human-Object Interaction Synthesis with Visual Priors
-
Ours proposes ViHOI, a plug-and-play framework that leverages VLMs to extract decoupled visual and textual priors from 2D reference images. These are compressed into compact conditional tokens via Q-Formers to enhance the HOI motion generation quality of diffusion models. During inference, it utilizes text-to-image models to synthesize reference images, achieving strong generalization to unseen objects.
- Vinedresser3D: Agentic Text-guided 3D Editing
-
Vinedresser3D is proposed as a 3D editing agent centered on Multimodal Large Language Models (MLLMs). It eliminates the need for user-provided 3D masks by automatically parsing editing intent, locating editing regions, and generating multimodal guidance. By executing inversion-based in-painting in the latent space of a native 3D generative model (Trellis), high-quality text-guided editing of 3D assets is achieved.
- VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
-
Addressing the frequent editing omissions of diffusion models in professional design tasks (where a single instruction contains 18–22 targets), this paper first constructs LGBench (2000 tasks, 29k annotated targets) to expose failures. It then proposes VisionDirector—a training-free "director-style" closed-loop controller: a VLM planner decomposes long instructions into structured goals, dynamically decides between single-shot generation or multi-stage editing, and performs micro-grid sampling with semantic validation/rollback at each step. Finally, GRPO is used to compress the planner's editing trajectory from 4.2 to 3.1 steps, achieving new SOTA results on GenEval (+7%) and ImgEdit (+0.07).
- ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
-
ViStoryBench constructs a comprehensive benchmark comprising 80 multi-style stories, 344 characters, and 1,317 shots. It proposes 12 automated evaluation metrics (covering character consistency, style similarity, prompt alignment, copy-paste detection, etc.) to systematically evaluate over 25 open-source and commercial story visualization methods, filling the gap of unified evaluation standards in the field.
- Visual Diffusion Models are Geometric Solvers
-
The authors discover that a standard visual diffusion model (U-Net denoiser) can directly approximate a set of NP-hard geometric problems (Inscribed Square, Steiner Minimum Tree, Maximum Area Simple Polygon) in pixel space. By representing geometric challenges as images and treating diffusion sampling as the process of "generating a valid solution from noise," three distinct problems share the same architecture, differing only in training data.
- VOSR: A Vision-Only Generative Model for Image Super-Resolution
-
This work proposes VOSR, the first to demonstrate that a vision-only trained generative super-resolution (SR) model can match or even surpass methods based on T2I pre-training. By utilizing visual semantic conditions and a restoration-oriented guidance strategy, VOSR achieves high-quality SR with training costs only 1/10 of T2I-based methods.
- WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
-
Through analyzing the norm-direction decomposition of weight changes during distillation, it is discovered that direction change is the primary driver of distillation (change magnitude is \(22\times\) larger than norm). The authors propose LoRaD (Low-Rank weight Direction rotation) adapter and integrate it into the VSD framework to form WaDi, achieving SOTA one-step FID on COCO with only ~10% trainable parameters.
- When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
-
Authors discover that different Text-to-Image (T2I) models generate images for the same prompt that form tight, distinct clusters in image embedding spaces. Applying a zero-training, black-box, nearest-centroid classification method across 22 models and 280 prompts (150k images), they identify the "anonymous" source model with 91% top-1 accuracy, undermining the anonymity assumption essential for fair voting-based T2I leaderboards.
- When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
-
This paper points out that the quality of Latent Diffusion Models (LDM) depends on whether the VAE latent space simultaneously satisfies "local smoothness" and "global dispersion." It proposes SORL—a bottom-up training paradigm that utilizes two simple local rules, "local attraction" and "local repulsion," to allow these two global structures to emerge spontaneously, thereby simultaneously improving reconstruction fidelity and generation diversity.
- When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
-
The authors evaluate over ten open-source T2I diffusion models released between 2022 and 2025 as "synthetic training data generators." By training classifiers on synthetic images and evaluating them on real test sets, they discover a counter-intuitive trend: newer models with better visual quality and prompt following produce less useful data. The Synth→Real accuracy has consistently declined over time because newer models collapse the distribution onto a narrow "aesthetic center" manifold, losing texture and high-frequency details while sacrificing diversity.
- When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
-
Ours proposes Conflict-aware Adaptive Safety Guidance (CASG), a training-free plug-and-play framework that resolves safety degradation caused by directional conflicts when aggregating multiple categories. It dynamically identifies the harmful category most aligned with the current generation state and applies safety guidance only along that direction.
- WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
-
WiseEdit decomposes instructive image editing into a three-level cognitive process of "Awareness—Interpretation—Imagination" paired with three categories of knowledge ("Declarative/Procedural/Metacognitive"). It constructs a challenging benchmark of 1,220 Chinese-English bilingual cases, including 26% multi-image inputs. Using GPT-4o for scoring across five dimensions—including self-developed Knowledge Fidelity (KF) and Creative Fusion (CF)—the study systematically exposes the shortcomings of current SOTA editing models in knowledge reasoning and compositional creation.
- WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
-
Ours proposes WISER, a training-free Zero-Shot Composed Image Retrieval (ZS-CIR) framework. It unifies T2I and I2I dual-path retrieval through a "Retrieval–Verification–Refinement" iterative loop. By utilizing a VLM verifier to explicitly model intent-awareness and uncertainty-awareness, WISER achieves adaptive fusion and structured self-reflective refinement. It delivers relative improvements of 45% in CIRCO mAP@5 and 57% in CIRR Recall@1, outperforming many training-based methods.
- You Only Erase Once: Erasing Anything without Bringing Unexpected Content
-
YOEO employs a few-step diffusion erasure model trained on unpaired data (real images without "erased ground truth"). By utilizing "sundry detector + entity coherence" as two pair-free supervisory signals, it achieves clean object erasure in a single pass without generating unexpected content. Despite having only 860M parameters, it significantly outperforms 12B Flux-based methods in sundry-related metrics.
- Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
-
This paper demonstrates that the widely used heuristic of linearly interpolating two latents according to a mask in the VAE latent space is mathematically incorrect. It proposes the "Pixel-Equivalent" principle for latent composition and introduces DecFormer, a lightweight 7.7M-parameter transformer that learns this equivalent operator. DecFormer reduces mask boundary errors by up to 53% with only approximately 3.5% FLOPs overhead, without modifying the diffusion backbone.