CVPR2025 Image Generation AI paper notes paper summaries Diffusion Models Text-to-Image Personalized Generation Alignment/RLHF Adversarial Robustness Layout & Composition

🎨 Image Generation¶

📷 CVPR2025 · 305 paper notes

📌 Same area in other venues: 📷 CVPR2026 (492) · 🔬 ICLR2026 (353) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (221)

🔥 Top topics: Diffusion Models ×116 · Text-to-Image ×27 · Personalized Generation ×18 · Alignment/RLHF ×13 · Adversarial Robustness ×13

3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion: This paper proposes 3DTopia-XL, a native 3D generation model based on a novel primitive representation PrimX and a Diffusion Transformer. It generates high-quality 3D assets with high-resolution geometry, texture, and PBR materials from text or image inputs, significantly outperforming existing methods in both quality and efficiency.
A Bias-Free Training Paradigm for More General AI-generated Image Detection: This work proposes the B-Free training paradigm—generating semantically aligned fake images from real images via self-conditioned reconstruction with Stable Diffusion, combined with inpainting-based content augmentation to eliminate format, content, and resolution biases. This allows the detector to focus on generator-specific artifacts, achieving a generalization $\text{AUC} > 99\%$ and a balanced accuracy of 95.2% across 27 generator models (including recent models like FLUX and SD 3.5).
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation: This paper systematically investigates the effectiveness of using decoder-only LLMs as text encoders for text-to-image diffusion models. The authors find that while directly using the last-layer embeddings yields worse results than T5, aggregating embeddings across all layers via layer-normalized averaging significantly outperforms the T5 baseline.
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization: This paper proposes Step-by-step Preference Optimization (SPO), which samples multiple candidates from the same noisy latent at each denoising step and employs a step-aware preference model to select win/lose pairs to guide diffusion model fine-tuning. By implicitly distilling aesthetic information from generic preference data, SPO significantly improves aesthetic quality on SD-1.5 and SDXL, while achieving substantially faster convergence than DPO.

HOI-IDiff: An Image-like Diffusion Method for Human-Object Interaction Detection

AniDoc: Animation Creation Made Easier

AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer: This paper proposes AniMer, which introduces a high-capacity ViT backbone to quadrupedal SMAL parameter estimation for the first time. By distinguishing shape distributions of different species through animal family-level supervised contrastive learning, and using ControlNet-based synthetic dataset CtrlAni3D (10k images), it comprehensively outperforms existing methods on Animal3D, CtrlAni3D, and the cross-domain Animal Kingdom dataset.
SPAI: Any-Resolution AI-Generated Image Detection by Spectral Learning: This work proposes SPAI, which models the frequency distribution of real images through Masked Spectral Learning. By introducing Spectral Reconstruction Similarity (SRS) and Spectral Context Attention (SCA), it detects AI-generated images as out-of-distribution (OOD) samples. SPAI achieves an average AUC of 91.0% across 13 generation models, an absolute improvement of 5.5% over the second-best method, while supporting detection of images with arbitrary resolutions.
Arbitrary-Steps Image Super-Resolution via Diffusion Inversion: This paper proposes InvSR, which achieves diffusion inversion by training a noise prediction network. Utilizing the image prior of a pre-trained diffusion model for super-resolution, it supports arbitrary-step sampling from 1 to 5 steps, achieving or exceeding the performance of existing state-of-the-art (SOTA) methods even with single-step sampling.
ArtiFade: Learning to Generate High-quality Subject from Blemished Images: This paper proposes ArtiFade, the first method to address the problem of "blemished subject-driven generation". By constructing paired blemished-unblemished datasets, partially fine-tuning the cross-attention weights of diffusion models, and optimizing an artifact-free embedding, it enables existing subject-driven methods (e.g., Textual Inversion, DreamBooth) to generate high-quality, artifact-free subject images from inputs containing blemishes such as watermarks, stickers, or adversarial noise.
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys: This work proposes AS-Bridge, a bidirectional Brownian Bridge diffusion model designed to model the stochastic mapping relationship between two major astronomical surveys: the ground-based LSST and the space-based Euclid. It achieves probabilistic cross-survey translation and rare event detection (strong gravitational lensing), and demonstrates that the epsilon-prediction training objective benefits both reconstruction quality and likelihood estimation.
AutoPresent: Designing Structured Visuals from Scratch: This paper proposes the AutoPresent framework and the SlidesBench benchmark to systematically study the task of generating presentation slides from natural language instructions for the first time. By leveraging LLMs to generate Python code (instead of end-to-end image generation) to create PPTX slides, combined with the SlidesLib utility library and iterative refinement, an 8B open-source model achieves performance close to GPT-4o.
Autoregressive Distillation of Diffusion Transformers: Proposes Autoregressive Distillation (ARD), which utilizes historical information of ODE trajectories instead of only the current denoised sample as input to predict future steps. By modifying the teacher transformer architecture with token-wise time embeddings and block-wise causal attention masks, it achieves an FID of 1.84 in 4 steps on ImageNet-256 with only a 1.1% increase in extra FLOPs.
AvatarArtist: Open-Domain 4D Avatarization: AvatarArtist is proposed, which cooperatively constructs a multi-domain image-triplane dataset using GAN and diffusion models to train a DiT for generating parametric triplanes, combined with a motion-aware cross-domain renderer to achieve drivable 4D avatars from a single portrait of arbitrary styles.
Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing: This paper systematically organizes alternative/extended operators of convolution in learning-based image processing into five major families (decomposition-based, adaptive weighted, basis-adaptive, integral/kernel-based, and attention-based), and provides a comparative analysis across multiple dimensions including linearity, locality, equivariance, computational cost, and task suitability.
Bias for Action: Video Implicit Neural Representations with Bias Modulation: Proposes ActINR, which achieves continuous video representation by sharing weights across frames and modeling motion solely through biases in INR. Under 10× slow-motion, 4× spatial + 2× temporal super-resolution, video denoising, and inpainting tasks, it significantly outperforms existing methods (with average improvements of 3-6dB).
BiGain: Unified Token Compression for Joint Generation and Classification: BiGain, for the first time, reformulates token compression in diffusion models as a dual-objective optimization problem for both generation and classification. It proposes two frequency-aware operators: Laplacian-Gated Token Merging (L-GTM) and Interpolation-Extrapolation KV Downsampling (IE-KVD). BiGain significantly improves classification accuracy while maintaining generation quality (Acc +7.15%, FID -0.34 under a 70% merging ratio on ImageNet-1K).
Boost Your Human Image Generation Model via Direct Preference Optimization: This paper proposes HG-DPO, which utilizes real human images as winning images (instead of generated image pairs) and designs a three-stage curriculum learning strategy (Easy/Normal/Hard) to progressively bridge the distribution gap between generated and real images. Combined with a statistics matching loss to resolve color shift, it reduces the FID from 37.34 to 29.41 (-21.4%), improves CI-Q from 0.906 to 0.934, and outperforms Diffusion-DPO with a 99.97% win rate.
BootPlace: Bootstrapped Object Placement with Detection Transformers: BootPlace is proposed to reformulate the object placement problem as "placement-by-detection". By training detection transformers on object-removed backgrounds to identify candidate regions, and then matching target objects to the optimal regions using negative-correlation semantic complementarity, it improves the top-5 IoU on Cityscapes by approximately 4× compared to the state-of-the-art.
BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training: Proposes BooW-VTON, which trains a virtual try-on diffusion model requiring no human parsing masks through high-quality pseudo-data construction, in-the-wild data augmentation, and try-on localization loss. It comprehensively outperforms existing methods across multiple benchmarks including VITON-HD, StreetVTON, and WildVTON.
CacheQuant: Comprehensively Accelerated Diffusion Models: CacheQuant is proposed, a training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching (temporal level) and quantization (structural level). It achieves 5.18× acceleration and 4× compression on Stable Diffusion, with a CLIP score loss of only 0.02.
Calibrated Multi-Preference Optimization for Aligning Diffusion Models: This paper proposes Calibrated Preference Optimization (CaPO), which unifies scores from different reward models into expected win-rates through win-rate calibration, and designs a frontier-based rejection sampling (FRS) strategy based on the Pareto frontier to handle conflicts between multiple reward signals, consistently outperforming DPO and IPO methods on SDXL and SD3-Medium.
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model: CamFreeDiff is proposed to achieve 360° panorama generation from a single camera-free image by integrating a lightweight 3-DoF homography estimator into a multi-view diffusion framework. This reduces the FID from 42.4 (MVDiffusion) to 27.0 and generalizes to out-of-domain data without fine-tuning.
Can Generative Video Models Help Pose Estimation?: InterPose is proposed, which leverages pre-trained video generative models to "hallucinate" intermediate frames between two images with little or no overlap. Combined with a self-consistency score to select the best video, it consistently improves pose estimation accuracy across four datasets on top of DUSt3R.
Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes: Proposed a channel-wise noise scheduling method that enables a single diffusion model architecture to achieve two inverse rendering modes: accuracy-first (SDM, $T=4$) and diversity-first (PDM, $T=1000$) through different noise schedules. Concurrently, an implicit lighting representation (ILR) is introduced to support pixel-wise environment map reasoning and real object insertion.
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting: This paper proposes ChatGen, which automates three tedious steps in text-to-image (T2I) generation: prompt engineering, model selection, and parameter configuration. Through a multi-stage evolutionary training strategy (ChatGen-Evo), it enables users to obtain high-quality generated images simply by describing their requirements through free-style chatting.
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization: From a dynamical systems perspective, the concept of "Attraction Basin" is proposed to explain the memorization phenomenon in diffusion models. Applying CFG inside the attraction basin causes trajectories to converge to memorized training images. Detecting transition points to delay CFG activation (combined with Opposite Guidance, OG) mitigates memorization with zero extra computational overhead.
CleanDIFT: Diffusion Features without Noise: Proposes CleanDIFT, which enables diffusion models to directly extract high-quality semantic features on clean images through lightweight unsupervised fine-tuning (only 30 minutes on a single A100 GPU). This eliminates the limitations of traditional methods requiring noise addition and timestep tuning, significantly outperforming standard diffusion features on multi-tasks such as semantic correspondence, depth estimation, and segmentation.
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation: This work systematically reveals two types of bias in CLIP within multi-object scenarios: text encoders bias toward earlier-mentioned objects, and image encoders bias toward larger objects. It traces the origin of these biases to the statistical pattern in contrastive training data where larger objects tend to be mentioned first.
Co-Spy: Combining Semantic and Pixel Features to Detect Synthetic Images by AI: Co-Spy is proposed to fuse two complementary detection pathways: VAE reconstruction artifact features and CLIP semantic features. VAE artifacts generalize well across models but are vulnerable to JPEG compression, while CLIP semantics are robust to JPEG compression but generalize poorly. An adaptive regulator dynamically allocates weights to the two pathways based on the input, establishing a new SOTA across 22 generative models.
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation: This paper proposes coDrawAgents, an interactive multi-agent dialogue framework consisting of four expert agents: Interpreter, Planner, Checker, and Painter. Through divide-and-conquer incremental layout planning, visual context-aware reasoning, and explicit error correction, it achieves 0.94 (SOTA) on GenEval and 85.17 (SOTA) on DPG-Bench.
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient: This work proposes CoDe (Collaborative Decoding), which decomposes the multi-scale inference of VAR into a collaborative workflow: a large model drafts (low-frequency small scales) and a small model refines (high-frequency large scales). It achieves a 1.7× speedup and a 50% memory reduction, with the FID only slightly increasing from 1.95 to 1.98.
Color Alignment in Diffusion: This work proposes a color alignment diffusion method. By projecting intermediate samples or predictions into a conditional color space (using nearest-neighbor color mapping), the diffusion model strictly follows a given color distribution (color values and proportions) while retaining structural generation freedom. It supports three settings: retraining, fine-tuning, and zero-shot.
Community Forensics: Using Thousands of Generators to Train Fake Image Detectors: This work constructs the Community Forensics dataset containing 4,803 generative models and 2.7 million images. It reveals that scaling up the number of models, even those with similar architectures, significantly enhances the generalization of fake image detection, achieving a state-of-the-art average mAP of 0.966 across multiple benchmarks.
Compass Control: Multi Object Orientation Control for Text-to-Image Generation: Proposes Compass Control, which achieves precise 3D orientation control for multiple objects in text-to-image diffusion models by introducing a lightweight orientation encoder to predict compass tokens combined with a Coupled Attention Localization (CALL) mechanism. Requiring only synthetic data for training, it generalizes to unseen categories and multi-object scenes.
Composing Parts for Expressive Object Generation: Proposes PartComposer, a training-free method that localizes object parts from attention maps via parallel "part diffusion" and uses regional diffusion to independently generate user-specified fine-grained attributes (color, style, description) for each part, achieving part-level controllable image synthesis.
Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization: A unified framework for human relighting and background harmonization based on pre-trained diffusion models is proposed, which achieves illumination-consistent relighting for both static and video scenes using a coarse-to-fine strategy (spherical harmonics ControlNet providing coarse lighting + diffusion model learning fine residuals) and an unsupervised motion ControlNet.
Concept Lancet: Image Editing with Compositional Representation Transplant: Proposes Concept Lancet (CoLan), a zero-shot plug-and-play image editing framework. By sparsely decomposing the latent representation of the source image into a linear combination of visual concept vectors and then performing customized concept transplantation according to the editing task (replacement/addition/removal), CoLan resolves the challenge of editing intensity calibration.
Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization: Proposes Concept Replacer, which precisely identifies sensitive concept regions during the denoising process through a few-shot-trained concept localizer, and then replaces the localized region with safe content using training-free Dual-Prompt Cross-Attention (DPCA). This achieves precise local concept replacement instead of global image distortion.
ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation: ConceptGuard is proposed to mitigate catastrophic forgetting and concept confusion in continual personalized T2I generation through four strategies: shift embeddings, concept-binding prompts, memory-preserving regularization, and priority queue replay. It significantly outperforms existing methods on multi-concept benchmarks.
Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation: Analyzing the differences in style and structure sensitivity across self-attention layers of SDXL reveals that injecting conditional information only into the most sensitive subset of layers significantly improves the style-content trade-off in multi-conditional generation without extra training.
Consistent and Controllable Image Animation with Motion Diffusion Models: This paper proposes Cinemo, an image animation method based on diffusion models. By learning the distribution of motion residuals (rather than directly predicting frames), it significantly improves temporal consistency with the input image. Combined with SSIM motion intensity control and DCT noise initialization, it achieves finely controllable I2V generation, comprehensively outperforming existing methods on UCF-101 and MSR-VTT.
BootComp: Controllable Human Image Generation with Personalized Multi-Garments: This paper proposes BootComp, which trains a decomposition network to extract product-view garment images from human images to construct a large-scale synthetic paired dataset. It then trains a dual-path diffusion model to generate controllable human images conditioned on multiple reference garments, achieving a 30% improvement in MP-LPIPS over the state-of-the-art (SOTA).
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning: CTRL-O introduces language controllability into object-centric representation learning. By using language embedding to initialize slot queries, conditioning the decoder on language, and employing a control contrastive loss, it achieves language-object binding without mask supervision. It achieves an FG-ARI of 47.5 on COCO (+7.0 over Dinosaur), while supporting zero-shot referring expression segmentation, instance-level image generation, and VQA.
Curriculum Direct Preference Optimization for Diffusion and Consistency Models: This work introduces curriculum learning into DPO for the first time and adapts DPO to consistency models. By progressively training from "easy-to-distinguish preference pairs" to "hard-to-distinguish preference pairs", it comprehensively outperforms standard DPO and DDPO in text alignment, aesthetics, and human preferences, while requiring only 1/10 of the training data.
CustAny: Customizing Anything from A Single Example: This paper constructs the first large-scale generic object customization dataset MC-IDC (315K images, 10K+ categories) and proposes the CustAny framework. By utilizing multi-model ID extraction, global-local dual-level ID injection, and an ID-aware decoupling module, CustAny achieves zero-shot customized generation of arbitrary objects from a single reference image.
Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales: This paper proposes to perform 8-bit quantization on the entire Winograd convolution pipeline using group-wise quantization, and resolves the issue of large dynamic ranges in the output transform via data-free fine-tuning of the scaling parameters of the Winograd transform matrix. It achieves near-lossless image generation quality and a 31.3% speedup of convolutions on diffusion models.
Decentralized Diffusion Models: Decentralized Diffusion Models (DDM) proposes a method to distribute diffusion model training across completely isolated compute clusters. By independently training expert models on data partitions and integrating them using a lightweight router during inference, the authors prove that this ensemble precisely optimizes the same global Flow Matching objective as a single model, outperforming a single large model on a FLOP-for-FLOP basis.
DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image: This work proposes DeClotH, which reconstructs decomposable 3D clothing and human body meshes from a single image. It leverages 3D templates (SMPLicit+SMPL) as geometric priors to mitigate occlusion issues and trains a dedicated ClothDiffusion model to provide garment-specific texture and geometric guidance.
Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning: This paper views diffusion model training as a multi-task learning problem and proposes the Decouple-then-Merge (DeMe) framework. By first group-finetuning multiple specialized models across different timesteps to eliminate gradient conflicts, and then merging them back into a single model in the parameter space, DeMe significantly improves generation quality without introducing extra inference overhead.
Decoupling Training-Free Guided Diffusion by ADMM: This paper proposes ADMMDiff, which decouples "unconditional generation" and "conditional guidance" in training-free conditional diffusion generation into two independent subproblems using the Alternating Direction Method of Multipliers (ADMM). This automatically balances the two without manual tuning of weight hyperparameters, outperforming existing methods across various conditional generation tasks.
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI: This paper proposes FreeMCG, which utilizes diffusion models to generate particle sets on the data manifold and combines them with the Ensemble Kalman Filter (EnKF) to approximate the projection of model gradients onto the manifold. FreeMCG represents the first framework to unify feature attribution and counterfactual explanations while requiring only black-box model access.
Detecting Adversarial Data Using Perturbation Forgery: By modeling the Gaussian distribution of adversarial noise and proving its proximity relation, the Perturbation Forgery method is proposed to continuously perturb the noise distribution during training to form an open cover. Combined with sparse masks to generate pseudo-adversarial data to train a binary classifier, using only the noise distribution from a single attack (FGSM) can generalize to detect various unseen attacks, including gradient-based, GAN-based, diffusion-based, and physical attacks, achieving an AUROC of 0.99+ with extremely low inference overhead.
DiC: Rethinking Conv3x3 Designs in Diffusion Models: This paper revisits the potential of 3x3 convolutions in diffusion models. By introducing a series of architectural improvements (hourglass U-Net and sparse skip connections) and conditioning enhancements (stage-specific embeddings, mid-block injection, and condition gating), the authors build a pure 3x3 convolutional diffusion model, DiC. It outperforms DiT of comparable scale on ImageNet generation while achieving significantly faster inference.
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment: Proposes the Diff2Flow framework, which achieves efficient knowledge transfer from pre-trained diffusion models to Flow Matching models through timestep rescaling, interpolation alignment, and velocity field derivation. This achieves performance superior to or on par with the SOTA across multiple tasks such as text-to-image generation and depth estimation with minimal fine-tuning overhead.
DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models: By automatically creating the largest 3D synthetic hair dataset to date (40K styles), this paper trains a diffusion-Transformer-based scalp texture generation model. It is the first to directly predict the latent texture map of individual hair strands (rather than guide strands) conditioned on an image, achieving diverse 3D hair reconstruction including afros and bald patterns from a single image.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation: This paper introduces a new task, "Customized Manga Generation," and proposes the DiffSensei framework, which uses an MLLM as a text-compatible identity adapter to bridge diffusion models. By utilizing masked cross-attention for precise layout control, DiffSensei significantly outperforms existing methods on the self-constructed large-scale MangaZero dataset (43K pages / 427K annotated panels).
Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models: This paper proposes the Diffusion-4K framework, which consists of the Aesthetic-4K benchmark dataset, GLCM Score/Compression Ratio evaluation metrics, and a wavelet-based fine-tuning method. It enables large-scale latent diffusion models (LDMs) such as SD3-2B and Flux-12B to directly generate high-quality 4096×4096 images with rich texture details.
Diffusion Self-Distillation for Zero-Shot Customized Image Generation: This paper proposes Diffusion Self-Distillation (DSD), which leverages the emergent capability of pre-trained T2I models to generate consistent grid images in order to automatically construct identity-preserving paired datasets (using LLMs for prompt generation and VLMs for filtering). The same model is then fine-tuned to achieve zero-shot identity-preserving image generation, delivering results close to DreamBooth without requiring test-time optimization.
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention: DiG introduces Gated Linear Attention (GLA) into the backbone of diffusion models, addressing the unidirectional modeling and lack of local awareness in GLA through a Spatial Redirection and Enhancement Module (SREM). This achieves performance surpassing DiT on the ImageNet 256×256 generation task, while offering a 2.5x speedup and a 75.7% reduction in GPU memory at a resolution of 1792.
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability: This paper proposes the DiffLens framework, which disentangles the internal neurons of diffusion models into a monosemantic feature space using sparse autoencoders (k-SAEs), and then localizes specific features driving bias generation through gradient-based attribution. This enables fine-grained control and mitigation of social biases such as gender and race while maintaining image quality.
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression: DiT-IC adapts a pre-trained T2I diffusion Transformer into a one-step image compression reconstruction model. Operating in a 32x downsampled deep latent space, it achieves SOTA perceptual quality through three alignment mechanisms—variance-guided reconstruction flow, self-distillation alignment, and latent conditional guidance—while decoding 30x faster than existing diffusion codecs.
DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows: This paper proposes DiverseFlow, a training-free inference-time method that introduces inter-sample coupled gradient constraints during the ODE solving process of flow models via Determinant Point Processes (DPP), significantly improving the diversity and mode coverage of generated samples under a fixed sampling budget.
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification: A heterogeneous noise diffusion purification strategy based on attention masks is proposed. It applies high-intensity noise to crucial pixels focused on by the classifier to eradicate adversarial perturbations while applying low-intensity noise to the remaining regions to preserve semantic information, significantly reducing computational overhead via single-step resampling.
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation: This paper proposes Divot, a continuous video tokenizer that utilizes a diffusion process for self-supervised video representation learning. By training representations through a diffusion model conditioned on tokenizer features for denoising, and using a Gaussian Mixture Model (GMM) to model the continuous video feature distribution output by the LLM, a unified framework for video comprehension and generation is achieved.
DNF: Unconditional 4D Generation with Dictionary-Based Neural Fields: DNF proposes a 4D neural field representation based on dictionary learning. It achieves decoupled and compact encoding of shape and motion through an SVD decomposition-compression-expansion MLP parameter dictionary. Combined with a Transformer diffusion model, it enables unconditional 4D deforming object generation, achieving state-of-the-art (SOTA) performance on DeformingThings4D.
Do Visual Imaginations Improve Vision-and-Language Navigation Agents?: This paper uses SDXL to generate synthetic images as "imaginations" for visual landmarks in VLN instructions. These are encoded via ViT and concatenated with text instruction embeddings before being input into the VLN agent. Guided by a cosine similarity alignment loss, this approach consistently improves navigation success rates by approximately 1% on both R2R and REVERIE, verifying the preliminary value of visual imagination as a bridge between language and vision.
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles: DoraCycle is proposed to achieve unpaired domain adaptation of unified multimodal generative models using two multimodal cycles (text $\to$ image $\to$ text and image $\to$ text $\to$ image). Utilizing only unpaired target domain data, it achieves performance close to fully paired training (FID 27.44 vs 24.93), and is virtually lossless with 10% paired + 90% unpaired data (FID 25.37).
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching: DreamCache is proposed to achieve tuning-free, encoder-free, and plug-and-play personalized image generation by caching the intermediate U-Net features of the reference image at a single denoising step ($t=1$) and injecting these cached features during generation using a lightweight 25M-parameter conditional adapter.
DreamOmni: Unified Image Generation and Editing: This work builds a unified 2.5B DiT model for text-to-image (T2I) generation and various editing tasks (instruction-based editing, inpainting, drag editing, and reference-based generation). It replaces the text encoder with Qwen2-VL to achieve unified vision-language prompt understanding and constructs a synthetic sticker data pipeline to efficiently create editing training data, achieving SOTA results in both generation and editing.
DreamRelation: Bridging Customization and Relation Generation: DreamRelation proposes a relation-aware customized image generation framework. Through three key designs—a decoupled data engine, Keypoint Matching Loss (KML), and local token injection—it maintains the identity consistency of multiple subjects while accurately generating textual relations (such as hugging, riding, etc.) between them, outperforming existing methods across the board on RelationBench.
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning: DreamVideo-Omni is proposed, achieving collaborative generation of multi-subject customization and omni-motion control (global bbox + local trajectory + camera motion) within a unified DiT framework through a progressive two-stage training paradigm (Omni-Motion SFT + Latent Identity Reward Feedback Learning).
DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation: DualAnoDiff is proposed to simultaneously generate high-quality anomaly image-mask pairs via a dual-interrelated diffusion model (a global branch generating the entire anomaly image + an anomaly branch generating the local anomalous part). A background compensation module is introduced to maintain the consistency of the background and object shape, which significantly improves downstream performance in anomaly detection, localization, and classification.
Dual Diffusion for Unified Image Generation and Understanding: The Dual Diffusion Transformer (D-DiT) is proposed, which simultaneously uses continuous diffusion to model image distribution and discrete masked diffusion to model text distribution within a single MM-DiT architecture. It represents the first end-to-end fully diffusion-based multimodal model, supporting a comprehensive suite of tasks including image generation, image captioning, and visual question answering.
Dual Diffusion for Unified Image Generation and Understanding: Proposes D-DiT (Dual Diffusion Transformer), the first fully end-to-end multimodal diffusion model, which employs continuous flow matching for the image branch and discrete masked diffusion for the text branch to simultaneously train both image generation and text understanding under a unified loss function.
Dual Prompting Image Restoration with Diffusion Transformers (DPIR): Proposes DPIR, an image restoration model based on SD3 (Diffusion Transformer). By incorporating a lightweight low-quality image condition branch and a global-local visual dual prompting branch, it introduces degradation image information from multiple perspectives, systematically applying DiT to image restoration for the first time and achieving SOTA performance.
DualAnoDiff: Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation: DualAnoDiff is proposed, which leverages a dual-interrelated diffusion model to simultaneously generate the integrated anomaly image and the corresponding anomaly parts. This solves the issues of insufficient diversity, unnatural blending, and misaligned masks in few-shot anomaly image generation, achieving SOTA performance in downstream anomaly detection tasks.
Dynamic Motion Blending for Versatile Motion Editing (MotionReFit): MotionReFit proposes the first versatile text-guided motion editing framework. By dynamically generating training triplets via the MotionCutMix data augmentation technique, combined with an autoregressive diffusion model and a body part coordinator, it achieves spatial and temporal editing encompassing body part replacement, style transfer, and fine-grained adjustment.
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting: EasyCraft proposes an end-to-end automatic avatar crafting framework. It maps facial images of arbitrary styles to a unified feature distribution using a general ViT encoder pre-trained with MAE, which are then converted into avatar customization parameters of game engines. Meanwhile, it integrates text-to-image technology to support text input, allowing easy adaptation to different game engines.
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting: Proposes EasyCraft, an end-to-end automatic avatar customization framework that translates photos of arbitrary styles into game crafting parameters using a self-supervised pre-trained general ViT encoder, and supports text-driven character creation by integrating Stable Diffusion.
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation: EDEN is proposed to comprehensively enhance the role of diffusion models in video frame interpolation from three dimensions: input representation, model architecture, and training paradigm. By compressing intermediate frames into semantically rich 1D token representations using a Transformer tokenizer, replacing the U-Net architecture with DiT, and introducing a dual-stream context integration mechanism (temporal attention + frame difference embedding), EDEN reduces LPIPS by nearly 10% on large-motion benchmarks like DAVIS, achieving high-quality generation with only 2 denoising steps.
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking: A unified theoretical and empirical analysis of how diffusion-based image editing "unintentionally" destroys robust invisible watermarks: forward noising decays the watermark SNR exponentially, and the manifold contraction effect of reverse denoising eliminates the watermark signal as an "unnatural residual." Even state-of-the-art watermarks like VINE drop to near random guessing (~60% bit accuracy) under strong editing ($t^*=0.8$).
Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models: A bilevel optimization framework is proposed to unify the fine-tuning recovery (lower-level: distillation + diffusion loss minimization) and undesirable concept suppression (upper-level: guiding the model away from target concepts) of pruned diffusion models into a single-stage optimization. This addresses the cyclic dependency issue in two-stage "fine-tune then erase" methods, where the optimal fine-tuning point does not equate to the optimal initialization for erasing, achieving a 27% reduction in the CSD metric for style removal.
Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction: This paper proposes CoordTok, a scalable video tokenizer that encodes video into a factorized triplane representation. The decoder learns the mapping from randomly sampled $(x,y,t)$ coordinates to the corresponding patch pixels (rather than reconstructing all frames at once). This design enables direct training of a large tokenizer on 128-frame long videos, compressing a 128-frame video into only 1280 tokens (compared to 6144-8192 tokens required by baselines), and driving a DiT to achieve one-shot 128-frame video generation (with a SOTA FVD of 369.3).
Efficient Personalization of Quantized Diffusion Model without Backpropagation (ZOODiP): This paper proposes ZOODiP, which achieves personalization (Textual Inversion) on quantized diffusion models via zeroth-order optimization. By utilizing subspace gradient projection for denoising and partial uniform timestep sampling to accelerate training, it achieves personalization quality comparable to gradient-based methods using only 2.37GB of VRAM and forward passes, reducing memory usage by up to 8.2x.
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing: This paper proposes EmoDubber, an emotion-controllable movie dubbing framework. By aligning lip movements and prosody using duration-level contrastive learning, enhancing speech clarity via a pronunciation enhancement strategy, and controlling emotion categories and intensity through a flow matching-based positive-negative guidance mechanism, EmoDubber comprehensively outperforms existing methods in lip-sync and pronunciation clarity.
EmoEdit: Evoking Emotions through Image Manipulation: This paper proposes EmoEdit, the first image manipulation framework that evokes specified emotions through content modification (rather than just color/style adjustments). It constructs the EmoEditSet dataset with 40,120 pairs, designs a plug-and-play Emotion Adapter, and achieves a significant balance between structural preservation and emotional evocation.
Enhancing Creative Generation on Stable Diffusion-based Models: This paper proposes C3 (Creative Concept Catalyst), a training-free method that enhances creative generation capabilities of Stable Diffusion by selectively amplifying features during the denoising process, and provides selection guidelines for amplification factors based on two primary dimensions of creativity.
Enhancing Facial Privacy Protection via Weakening Diffusion Purification: This paper weakens the purification effect in the reverse diffusion process of LDMs by learning timestep-wise unconditional embeddings, and leverages self-attention map guidance to maintain structural consistency, achieving an average PSR of 79.17% on CelebA-HQ and LADN, while outperforming all competing methods in FID.
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception: This paper proposes DIAE, which translates vague aesthetic instructions into multimodal control signals (HSV/contour maps + text) via a Multimodal Aesthetic Perception (MAP) module. It constructs an "imperfectly-paired" dataset, IIAEData, and uses a dual-branch supervision strategy to achieve weakly-supervised aesthetic enhancement, achieving SOTA performance on LAION and MLLM aesthetic scores.
Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models: This paper proposes the PRSS method, which achieves the optimal privacy-utility trade-off in mitigating diffusion model memorization without modifying the model or requiring training data during inference. It accomplishes this by improving the CFG formulation through two strategies: Prompt Re-anchoring (reusing the memorized prompt as a CFG anchor to guide generation away from memorized content) and Semantic Prompt Search (using an LLM to search for semantically similar alternative prompts that do not trigger memorization).
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data (SPARCL): This paper proposes SPARCL, which generates high-fidelity, fine-grained counterfactual synthetic images by injecting real image features into the padding embeddings of a fast T2I model. It also designs an adaptive margin loss to filter noisy synthetic samples and focus on learning hard samples, improving CLIP's compositional understanding accuracy by over 8% on average across four benchmarks, outperforming the state-of-the-art by 2% on three of them.
Erasing Undesirable Influence in Diffusion Models (EraseDiff): This paper proposes EraseDiff, which formalizes the data unlearning problem in diffusion models as a constrained optimization problem based on a value function. It optimizes both model preservation and erasing effectiveness simultaneously using a natural first-order algorithm. It runs 11x faster than SA and 2x faster than SalUn on DDPM/Stable Diffusion, while achieving Pareto-optimality in the preservation-forgetting trade-off.
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment: This paper reveals the implicit domain misalignment issue between the source domain and the synthetic domain in diffusion-driven TTA methods, and proposes the Synthetic-Domain Alignment (SDA) framework. By utilizing the Mix of Diffusion (MoD) technique to simultaneously align the source model and the target data to the same synthetic domain, the proposed method achieves consistent performance improvements across classification, segmentation, and multimodal large language models.
EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation: EvoTok proposes a unified image tokenizer based on Residual Latent Evolution. By cascading residual vector quantization in a shared latent space, it allows representations to progressively evolve from shallow pixel-level details to deep semantic-level abstractions. Trained on only 13M images, it achieves a reconstruction quality of 0.43 rFID and delivers outstanding performance across 7/9 understanding benchmarks, GenEval, and GenAI-Bench.
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis: This paper proposes Aurora, a text-to-image GAN model based on Sparse Mixture of Experts (Sparse MoE). By incorporating multiple expert networks and a text-aware sparse router in the generator to scale up model capacity, Aurora achieves a zero-shot FID of 6.2 on MS COCO at 64×64 resolution while maintaining an inference speed significantly faster than diffusion models.
FADE: Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models: Proposed FADE (Fine-grained Attenuation for Diffusion Erasure), which addresses the adjacency issue of concept erasure in text-to-image diffusion models for the first time—precisely erasing the target concept while preserving the generation capability of semantically adjacent concepts, improving preservation performance by at least 12% compared to SOTA.
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-Resolution: This paper proposes FaithDiff, which unleashes (fine-tunes) pre-trained diffusion model priors for image super-resolution for the first time. It designs an alignment module to bridge degraded image features and diffusion latent noise space, achieving faithful structural restoration through joint optimization of the encoder and diffusion model.
FDeID-Toolbox: Face De-Identification Toolbox: This paper proposes FDeID-Toolbox, a comprehensive toolbox oriented towards face de-identification (FDeID) research. By unifying four core components—data loading, method implementation, inference pipeline, and evaluation protocols—through a modular architecture, it addresses the long-standing pain points of fragmented implementations, inconsistent evaluation standards, and incomparable results in this field.
FilmComposer: LLM-Driven Music Production for Silent Film Clips: FilmComposer is the first to combine a large language model multi-agent system with waveform/symbolic music generation, simulating the workflow of professional musicians (spotting $\rightarrow$ composing $\rightarrow$ arranging $\rightarrow$ mixing) to automatically generate high-quality (48kHz), highly musical, and progressive film soundtracks from silent film clips.
FilmComposer: LLM-Driven Music Production for Silent Film Clips: Proposes FilmComposer, which simulates the workflow of professional musicians. Through three major modules—visual processing, rhythm-controllable MusicGen, and multi-agent arranging/mixing—it achieves the first automatic generation of high-quality professional soundtracks for silent film clips.
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models: FADE proposes an adjacency-aware fine-grained concept erasure framework. By leveraging a Concept Neighborhood to identify semantically proximate categories and designing Mesh Modules with a triple loss (Erasing + Adjacency + Guidance), FADE precisely erases the target concept while preserving the generation capabilities of semantically related concepts. It achieves at least a 12% improvement in adjacency preservation compared to SOTA methods.
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs: This paper proposes FineLIP, which supports long text inputs up to 248 tokens via position embedding stretching and introduces adaptive token refinement alongside cross-modal token-level alignment, significantly outperforming state-of-the-art methods in long-description text retrieval and text-to-image generation tasks.
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs: FineLIP enables CLIP models to handle long text descriptions and perform fine-grained visual-textual matching through positional embedding stretching (77 to 248 tokens), an Adaptive Token Refinement Module (ATRM), and a Cross-modal Late Interaction Module (CLIM). It significantly outperforms existing methods such as Long-CLIP and TULIP on long-description retrieval tasks.
Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models: This paper proposes Finite Difference Flow Optimization (FDFO), an online RL variant. By sampling paired trajectories and shifting flow velocities towards those that generate superior images, FDFO optimizes diffusion/flow-matching T2I models. Treating the entire sampling process as a single action, FDFO achieves faster convergence, higher output quality, and better prompt alignment compared to existing RL post-training methods.
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations: FlipSketch is the first to achieve unconstrained raster sketch animation generation from a single static sketch and text prompt. Through three key innovations—LoRA fine-tuning on a T2V diffusion model, a DDIM-inversion-based reference frame mechanism, and a dual-attention composition—it generates smooth, dynamic animation sequences while maintaining sketch identity.
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations: This work proposes FlipSketch, the first system that generates unconstrained raster sketch animations from a single static sketch and a text description, achieving smooth animation through three key innovations: fine-tuning a text-to-video diffusion model, iterative reference frame alignment, and dual-attention composition.
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation: Proposes Focus-N-Fix, a region-aware fine-tuning method for T2I models. By localizing problematic regions and constraining non-problematic regions to remain unchanged, it achieves precise correction of local quality issues (such as artifacts, over-sexualization, and violence) while avoiding catastrophic forgetting and reward hacking induced by global fine-tuning.
Font-Agent: Enhancing Font Understanding with Large Language Models: Constructs a large-scale multimodal dataset DFD containing 135,000 font-text pairs, and proposes Font-Agent—a vision-language model-based agent for font understanding. It captures font stroke details via an Edge-Aware Traces (EAT) module and refines the model's understanding of font styles through a Dynamic Direct Preference Optimization (D-DPO) strategy.
FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation: This paper proposes FoundHand, a domain-specific diffusion model trained on a tens-of-millions-scale hand dataset (FoundHand-10M). By employing 2D keypoint heatmaps as a universal control representation, FoundHand achieves precise control over hand poses/viewpoints and preserves identity appearance, demonstrating zero-shot emergent capabilities such as correcting deformed hands, video generation, and hand-object interaction videos.
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems: Proves that the DDIM deterministic reverse chain is a Partitioned Iterated Function System (PIFS), deriving three computationally accessible geometric quantities (contraction threshold $L_t^*$, expansion function $f_t(\lambda)$, and global expansion threshold $\lambda^{**}$) that require no model evaluation. Based on this, it theoretically explains four existing empirical design choices (cosine offset, resolution logSNR shift, Min-SNR weighting, and Align Your Steps).
Free-viewpoint Human Animation with Pose-correlated Reference Selection: Proposes a pose-correlated reference selection diffusion network that computes target-reference pose correlation maps via a pose-correlation module to adaptively select the most relevant reference features. It supports high-quality human animation generation under dramatic viewpoint changes (including camera zoom) and introduces the MSTed multi-camera TED video dataset.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition: LaDeCo introduces the layered design principles of graphic design into Large Multimodal Models (LMMs). It first uses GPT-4o to perform semantic layer planning for multimodal design elements, then progressively predicts element attributes layer by layer, rendering intermediate results to provide feedback to the model. This decomposes the complex design composition task into manageable sub-steps, significantly outperforming baseline methods in design composition quality.
From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing: This paper defines the text-to-diagram generation task, constructs DiagramGenBenchmark (covering 8 categories of diagrams), and proposes a multi-agent framework called DiagramAgent (Plan + Code + Check + Diagram-to-Code), which significantly outperforms existing text-to-image/code methods on diagram generation, coding, and editing tasks.
GCC: Generative Color Constancy via Diffusing a Color Checker: GCC leverages the image priors of pre-trained diffusion models to estimate illuminant color by generating a color checker reflecting scene illumination via inpainting. Incorporating Laplacian decomposition to preserve structural details while adapting to illumination variations, it demonstrates superior generalization capabilities in cross-camera scenarios.
GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration: This paper proposes GenDeg, a degradation synthesis framework based on Stable Diffusion, which can generate various controllable degradations (haze/rain/snow/motion blur/low-light/raindrops) on arbitrary clean images. By synthesizing over 550k images to construct the GenDS dataset, training All-In-One restoration models on it achieves significant performance improvements on out-of-domain test sets.
Generation of Maximal Snake Polyominoes Using a Deep Neural Network: Applies DDPM to generate maximal snake polyominoes, proposing a streamlined Structured Pixel Space Diffusion (SPS Diffusion). Despite being trained only up to $14 \times 14$ square grids, it generalizes to $28 \times 28$ and generates valid snakes, with some results surpassing known lower bounds of maximum length.
Generative Image Layer Decomposition with Visual Effects: LayerDecomp proposes an image layer decomposition framework based on Diffusion Transformers. It decomposes an input image into a clean RGB background layer and an RGBA foreground layer containing transparent visual effects (shadows, reflections). Utilizing a consistency loss, the model learns accurate foreground representations even from unlabeled data, significantly outperforming existing object-removal and spatial-editing methods.
Generative Modeling of Class Probability for Multi-Modal Representation Learning: CALM proposes a generative multimodal representation learning method based on class-anchor alignment. By introducing class labels from an independent dataset as anchors to bridge the modality gap between video and text, it models uncertainty using a cross-modal probabilistic variational autoencoder. This approach significantly outperforms existing methods across four benchmarks, particularly in out-of-domain evaluations.
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens: DDT-LLaMA proposes using diffusion timestep encoding to learn discrete visual tokens (DDT) with a recursive structure. This endows visual token sequences with hierarchical dependencies similar to natural language, thereby achieving SOTA performance in both multimodal understanding and generation under a unified next-token-prediction framework.
Generative Photomontage: The paper proposes the Generative Photomontage framework, allowing users to select different regions from multiple images generated by ControlNet. It achieves seamless blending through multi-label graph cut segmentation in the diffusion feature space and self-attention feature injection, enabling fine-grained compositional control over generated images.
GIF: Generative Inspiration for Face Recognition at Scale: Proposes replacing scalar labels in face recognition with structured identity codes (integer sequences). Code vectors are generated via CLIP initialization and hypersphere homogenization, followed by hierarchical clustering to build tree-structured codes. This reduces classifier computational complexity from $\mathcal{O}(m)$ to $\mathcal{O}(\log m)$ while addressing the minority collapse problem.
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning: This paper proposes GLASS, an object-centric learning method based on Slot Attention. By learning within the image space generated by a diffusion model, GLASS collaboratively addresses over-segmentation and under-segmentation issues using a semantic guidance module (generating pseudo-semantic masks from the cross-attention of the diffusion model) and an instance guidance module (reconstructing encoder features via an MLP). It significantly outperforms prior methods on object discovery, conditional generation, and compositional generation in real-world scenarios.
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing: Proposes GlyphMastero, a glyph encoder that leverages dual-stream (local character-level + global text line-level) feature extraction, cross-level attention interaction, and multi-scale FPN fusion to provide stroke-level precise glyph guidance for diffusion models, improving sentence accuracy by 18.02% and reducing FID by 53.28% in multilingual scene text editing.
Goku: Flow Based Video Generative Foundation Models: Goku is a family of rectified flow Transformer models (2B/8B) proposed by ByteDance and HKU, marking the first application of rectified flow to joint image-video generation. Assisted by a comprehensive data curation pipeline and large-scale training infrastructure optimization, Goku achieves state-of-the-art (SOTA) performance on benchmarks such as VBench (84.85) and GenEval (0.76).
GPS as a Control Signal for Image Generation: By using GPS coordinates from photo EXIF metadata as a new control signal for diffusion models, a joint GPS+text conditioned image generation model is trained. It can capture fine-grained architectural and appearance variations across different neighborhoods or landmarks within a city, and perform 3D landmark reconstruction extracted from 2D models via angle-conditioned SDS.
GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing: This paper introduces GRADE, the first benchmark designed to evaluate discipline-informed reasoning in image editing. Spanning 520 samples across 10 academic disciplines, it establishes a multidimensional evaluation protocol that reveals significant deficiencies in 20 state-of-the-art multimodal models on knowledge-intensive editing tasks.
GraphGPT-o: Synergistic Multimodal Comprehension and Generation on Graphs: This paper proposes GraphGPT-o, which injects structural information of multimodal attributed graphs (MMAGs, where nodes contain image+text and edges represent relations) into a Multimodal Large Language Model (MLLM). Through PPR sampling, a hierarchical Q-Former aligner, and flexible inference strategies, it achieves joint text-image generation conditioned on the graph context.
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform: h-Edit formalizes diffusion-based image editing as a backward-time bridge modeling problem based on Doob's h-transform. By decoupling editing updates into a "reconstruction term" and an "editing term," it achieves training-free joint editing guided by text and reward models for the first time, comprehensively outperforming existing SOTA methods on PIE-Bench.
Hiding Images in Diffusion Models by Editing Learned Score Functions: A method is proposed to hide images by editing the learned score function at a specific timestep of a diffusion model. Combined with gradient-aware parameter selection and LoRA to achieve parameter-efficient fine-tuning, the proposed method outperforms existing baselines by several orders of magnitude across three dimensions: extraction accuracy (52.90 dB PSNR), model fidelity (FID change of only 0.02), and hiding efficiency (0.04 GPU hours).
Hierarchical Flow Diffusion for Efficient Frame Interpolation: This paper proposes to explicitly denoise bidirectional optical flow in a coarse-to-fine manner using a hierarchical diffusion model (instead of directly denoising the latent space) for video frame interpolation, followed by a flow-guided image synthesizer to generate the final frame. This achieves SOTA accuracy while being 10×+ faster than other diffusion-based methods.
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation: HMAR reformulates the next-scale prediction of VAR into a Markov process (relying only on the cumulative reconstruction of the previous scale rather than all prior scales). It introduces multi-step masked generation within each scale to eliminate the conditional independence assumption. Coupled with a customized IO-aware block-sparse attention kernel, HMAR matches or exceeds VAR/DiT quality on ImageNet while achieving 2.5× training acceleration and a 3× reduction in inference memory.
HSI: A Holistic Style Injector for Arbitrary Style Transfer: HSI proposes a style transfer module based on global style statistical features and element-wise multiplication, replacing the quadratic complexity of self-attention with linear complexity. Simultaneously, it enhances stylization quality through a dual-relation learning mechanism, outperforming existing methods in both effectiveness and efficiency.
ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models: A two-stage framework, ICE, is proposed to automatically localize object-level concepts from a single image and decompose them into intrinsic properties (category, color, texture) using a single T2I diffusion model, achieving label-free and model-free hierarchical visual concept extraction.
IDEA-Bench: How Far are Generative Models from Professional Designing?: Proposes IDEA-Bench, the first comprehensive benchmark for professional-grade image design, covering 100 real-world design tasks (posters, picture books, typography, visual effects, etc.) and 5 input/output modes. It reveals that the strongest current model scores only 22.48/100, showing a massive gap still remaining before achieving professional-level design.
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation: IDProtector proposes the first feed-forward adversarial noise encoder that adds imperceptible adversarial perturbations to face photos via a single forward pass. It can simultaneously defend against multiple encoder-driven identity-preserving generation methods such as InstantID, IP-Adapter, and PhotoMaker, while remaining robust to transformations like JPEG compression and resizing.
ILIAS: Instance-Level Image Retrieval At Scale: ILIAS is a large-scale instance-level image retrieval benchmark consisting of 1,000 instance objects and 100 million distractor images. Through comprehensive benchmarking, it reveals the capabilities and limitations of current foundation models in specific object recognition, providing a far-from-saturated evaluation standard for this field.
Image Generation Diversity Issues and How to Tame Them: This work reveals that current diffusion models suffer from a severe lack of diversity (with state-of-the-art models covering only 77% of training data diversity). It proposes the Image Retrieval Score (IRS), a metric based on image retrieval that provides interpretable diversity measurement, and introduces Diversity-Aware Diffusion Models (DiADM) to enhance diversity without sacrificing generation quality.
Image Referenced Sketch Colorization Based on Animation Creation Workflow: Mimicking the actual animation production workflow, this paper proposes an image-referenced sketch colorization framework based on diffusion models. By introducing Split Cross-Attention coupled with a switchable LoRA mechanism, it processes foreground and background colorization separately, successfully eliminating spatial entanglement artifacts. After training on 4.8M images, it outperforms existing methods in qualitative, quantitative, and user study evaluations.
Implicit Bias Injection Attacks against Text-to-Image Diffusion Models: This paper proposes the Implicit Bias Injection Attack framework (IBI-Attacks). By pre-computing a universal bias direction vector in the text embedding space and dynamically adjusting it according to different user inputs using an adaptive feature selection module, this approach implants implicit biases (e.g., emotions, cultural inclinations) into pre-trained text-to-image diffusion models in a plug-and-play manner. This framework preserves the original semantics of the generated content and achieves over an 80% attack success rate, while only 35.8% of the attacks are perceived by human evaluators.
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing: This paper proposes Decoupled Annealed Posterior Sampling (DAPS). By decoupling the sample dependencies between adjacent steps during the diffusion sampling process, it allows large-scale non-local jumps to correct early sampling errors, substantially outperforming existing methods on non-linear inverse problems (e.g., phase retrieval).
Improving Editability in Image Generation with Layer-wise Memory: This paper proposes an iterative image editing framework based on layer-wise memory. By storing the latent and prompt embedding of each editing step and combining Background Consistency Guidance (BCG) with Multi-Query Decoupled Attention (MQD), the framework ensures consistent backgrounds and natural integration of new objects during multi-step sequential editing.
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment: This paper proposes DDIM-InPO, which treats the diffusion model as a single-step generative model and utilizes DDIM inversion to identify latent variables highly correlated with preference data, achieving state-of-the-art (SOTA) efficient diffusion preference alignment in only 400 fine-tuning steps.
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment: Proposes InPO (Inversion Preference Optimization), which simplifies preference optimization from a long Markovian process requiring a full denoising chain to a single-step optimization using a reparameterized DDIM inversion technique, outperforming existing Diffusion-DPO methods in both training efficiency and generation quality.
InsightEdit: Towards Better Instruction Following for Image Editing: This work proposes InsightEdit, constructs a high-quality editing dataset with 2.5 million pairs named AdvancedEdit, and designs a two-stream bridging mechanism to inject both the textual reasoning features and visual semantic features of an MLLM into a diffusion model, achieving SOTA performance in complex instruction following and background consistency.
Instant Adversarial Purification with Adversarial Consistency Distillation: This work proposes the One Step Control Purification (OSCP) framework, which integrates Gaussian Adversarial Noise Distillation (GAND) and Controlled Adversarial Purification (CAP) to achieve adversarial purification within a single U-Net inference step (~0.1 seconds), yielding a 100x acceleration compared to traditional diffusion-based purification methods.
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation: This paper Mojo presents the InterAct benchmark, which consolidates and standardizes 21.81 hours of 3D human-object interaction data (expanded to 30.70 hours). Through a unified optimization framework, it corrects motion capture artifacts and augments the data, defining six generation tasks and a unified modeling approach to achieve SOTA performance across multiple HOI generation tasks.
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing: This paper proposes InterEdit, the first text-guided multi-human 3D motion interaction editing framework. It achieves semantic editing while preserving the spatio-temporal coupling relationships between multiple humans in diffusion models through Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment.
InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions: InterMimic proposes a curriculum-driven teacher-student distillation framework, achieving for the first time learning diverse whole-body physical human-object interaction skills from large-scale imperfect MoCap data using a single policy. It first "perfects" each motion subset through teacher policies, then distills them into a student policy, and leverages RL fine-tuning to transcend simple imitation, ultimately supporting zero-shot generalization and seamless integration with motion generators.
Interpretable Generative Models through Post-hoc Concept Bottlenecks: This paper proposes two low-cost post-hoc methods—Concept Bottleneck Autoencoders (CB-AE) and Concept Controllers (CC)—to convert pre-trained generative models into interpretable and controllable models without retraining from scratch or requiring ground-truth annotations. They outperform prior CBGM methods in steerability by an average of approximately 25% on CelebA/CelebA-HQ/CUB datasets, while being 4–15x faster to train.
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation: This paper proposes JanusFlow, which directly integrates rectified flow into the autoregressive LLM framework. By decoupling understanding/generation encoders and utilizing representation alignment regularization, it achieves state-of-the-art performance in both multimodal understanding and image generation with only 1.3B parameters.
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs: This paper proposes K-LoRA, which compares the importance of subject and style LoRAs by accumulating the absolute values of Top-K elements in each attention layer, adaptively selects the entire layer's LoRA weights, and integrates a timestep scaling factor to achieve training-free, high-quality subject-style fusion.
Language-Guided Image Tokenization for Generation: TexTok proposes incorporating textual descriptions as conditions during the image tokenization stage, offloading high-level semantic information to text. This allows image tokens to focus on encoding fine-grained visual details, thereby achieving higher compression rates while maintaining or even improving reconstruction quality, leading to a state-of-the-art (SOTA) generation FID score of 1.46 on ImageNet.
Latent Space Imaging: Latent Space Imaging (LSI) proposes a new imaging paradigm that integrates optical encoding with generative model decoding. By directly encoding image information into the semantic latent space of StyleGAN, it achieves extreme compression ratios from 1:100 to 1:16384, while still enabling downstream tasks such as face reconstruction, attribute classification, segmentation, and landmark detection.
LaTexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending: LaTexBlend achieves high-fidelity, high-efficiency multi-concept customized image generation by representing and blending multiple customized concepts in the latent textual space following the text encoder. It scales linearly in fine-tuning complexity with zero additional inference overhead.
LaVin-DiT: Large Vision Diffusion Transformer: LaVin-DiT proposes a large vision foundation model based on the Diffusion Transformer. Through spatio-temporal VAE encoding, joint diffusion Transformer denoising, and in-context learning, it achieves unified processing of over 20 vision tasks. Scaling from 0.1B to 3.4B parameters, it significantly outperforms the autoregressive large vision model LVM on multiple tasks.
Learning Flow Fields in Attention for Controllable Person Image Generation: Proposes Leffa (Learning Flow Fields in Attention), which converts attention maps into flow fields within the attention layers of diffusion models and performs pixel-level regularization supervision. This explicitly guides the target query to attend to the correct reference key regions, successfully reducing fine-grained detail distortions (textures, text, logos) with zero additional inference overhead. It achieves state-of-the-art (SOTA) performance in both virtual try-on (VITON-HD, DressCode) and pose transfer (DeepFashion).
Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation: Proposes PAG (Prompt Adaptation with GFlowNets), which reformulates prompt adaptation as a probabilistic inference problem. By using GFlowNets to sample from the reward distribution instead of maximizing the reward, and combining three key techniques—flow reactivation, reward-prioritized sampling, and progressive reward decomposition—to solve the mode collapse issue, PAG generates text-to-image prompts that are both high in quality and diverse.
Learning Visual Generative Priors without Text: Proposes the Lumos framework, which learns visual generative priors through purely visual image-to-image (I2I) self-supervised pre-training, then matches or even surpasses existing T2I models with only 1/10 of the text-image pairs for fine-tuning. It also demonstrates superior performance over T2I priors on text-free visual tasks (I2V, NVS).
LEDiff: Latent Exposure Diffusion for HDR Generation: LEDiff is proposed to enable HDR generation in existing generative models and achieve SOTA LDR-to-HDR translation. This is achieved by performing exposure fusion in the latent space of a pre-trained diffusion model (rather than the image space) and fine-tuning the VAE decoder and denoiser with a small amount of HDR data.
Lifting Motion to the 3D World via 2D Diffusion: MVLift proposes a multi-stage framework trained solely on single-view 2D pose sequences. It establishes multi-view consistency through a progressive strategy (line-conditioned diffusion model $\rightarrow$ multi-view optimization $\rightarrow$ synthetic data generation $\rightarrow$ multi-view diffusion model) to achieve global 3D motion (including joint rotation and root trajectory) estimation without 3D supervision. It outperforms the 3D-supervised WHAM (164.3mm) with a root trajectory error of 67.6mm on AIST++.
LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping: This paper proposes LookingGlass, a method that leverages Laplacian Pyramid Warping (LPW) to extend the Visual Anagrams framework to latent-space rectified flow models and a broader range of spatial transformations. This allows the generation of anamorphosis images that are perceptually meaningful from both a normal perspective and specific catoptric/dioptric viewing setups.
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models: LoRACLR proposes a LoRA model merging method based on a contrastive learning objective. By learning a delta weight, it fuses multiple independently trained single-concept LoRA models into a unified model without retraining or accessing the original training data. This achieves high-fidelity multi-concept image generation, requiring only 5 minutes to merge 12 concepts.
lbGen: Low-Biased General Annotated Dataset Generation: The lbGen framework is proposed to fine-tune Stable Diffusion through bi-level semantic alignment (global adversarial + individual cosine similarity) and quality assurance losses. Using only category names, it generates a low-biased general annotated dataset. Backbones pretrained on this dataset outperform those trained on real ImageNet data by 1.7%~2.1% in average transfer accuracy.
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting: Proposes LumiNet, which injects the latent intrinsic features (a 128-dimensional albedo-like representation) of the source image and the latent extrinsic lighting code (16-dimensional) of the target image into a modified ControlNet. This enables image-only indoor scene-level light transfer, capturing complex effects such as specular highlights, shadows, and indirect illumination.
MagicQuill: An Intelligent Interactive Image Editing System: MagicQuill is proposed as an intelligent interactive image editing system that expresses editing intentions using three types of brushstrokes (add/subtract/color). A dual-branch diffusion plugin (inpainting + control) achieves precise control over edges and colors, while an MLLM guesses intentions in real time to automatically generate prompts, enabling a continuous editing workflow without manual text input.
Make It Count: Text-to-Image Generation with an Accurate Number of Objects: This paper proposes CountGen, an approach that isolates and counts object instances by identifying features carrying object identity information in the denoising process of diffusion models, and trains a layout prediction model to mitigate under-generation, enabling accurate object-counting text-to-image generation without relying on external layout generators.
MangaNinja: Line Art Colorization with Precise Reference Following: MangaNinja is a diffusion-based reference-guided line art colorization method. By training the model to learn local semantic matching capabilities via a progressive patch shuffling strategy and introducing a PointNet-driven point control mechanism for precise color correspondence, it significantly outperforms existing methods in challenging scenarios such as large pose variations, multi-reference inputs, and cross-character colorization.
MARBLE: Material Recomposition and Blending in CLIP-Space: Operates material embeddings solely in CLIP space, achieving material transfer and blending via targeted injection into material-responsive layers of UNet, and achieves parametric control of roughness, metallic, transparency, and glow by predicting attribute editing directions with a lightweight MLP, without fine-tuning the diffusion model.
MCA-Ctrl: Multi-party Collaborative Attention Control for Image Customization: This paper proposes MCA-Ctrl, a tuning-free image customization method. By utilizing Self-Attention Global Injection (SAGI) and Self-Attention Local Querying (SALQ) operations within the self-attention layers of three parallel diffusion processes, it simultaneously supports high-quality subject generation, replacement, and addition under both text and image conditions.
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation: MCCD proposes a compositional diffusion method based on multi-agent collaboration. It utilizes an MLLM-driven multi-agent system to parse complex scenes, and achieves accurate and high-fidelity generation of multi-object complex scenes through hierarchical compositional diffusion (Gaussian masks and regional enhancement) without requiring training.
Memories of Forgotten Concepts: This paper reveals a fundamental flaw in concept ablation methods for diffusion models: by finding highly likely latent seeds through diffusion inversion, it demonstrates that information about erased concepts still resides within the model, allowing high-quality images of the ablated concepts to be reconstructed from multiple distinct seed vectors.
MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis: MetaShadow proposes the first three-in-one framework that synergistically combines a GAN-based Shadow Analyzer (shadow detection + removal) with a diffusion-based Shadow Synthesizer (shadow synthesis). It transfers shadow knowledge by guiding the diffusion model with intermediate GAN features, achieving SOTA performance across all three shadow tasks.
MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification: MExD is the first to apply generative diffusion models to whole-slide image (WSI) classification. By utilizing a Dynamic Mixture-of-Experts (Dyn-MoE) aggregator to select key instances and provide conditional information, combined with a Diffusion Classifier (Diff-C) to iteratively reconstruct class labels from noise, it achieves state-of-the-art (SOTA) performance on three benchmarks: Camelyon16, TCGA-NSCLC, and BRACS.
MINIMA: Modality Invariant Image Matching: MINIMA proposes a unified cross-modal image matching framework. By designing a data engine to generate a multimodal synthetic dataset MD-syn (480M pairs) from cheap RGB image pairs, any existing matching pipeline can obtain cross-modal matching capability through simple fine-tuning, significantly outperforming modality-specific methods across 19 cross-modal scenarios.
Minority-Focused Text-to-Image Generation via Prompt Optimization: MinorityPrompt proposes an online prompt optimization framework. By iteratively optimizing a learnable token embedding during the inference stage to maximize a likelihood-related loss, it guides T2I diffusion models to generate minority samples located in the low-density regions of the data distribution, while maintaining semantic consistency and generation quality.
MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World: MirrorVerse constructs an enhanced synthetic dataset SynMirrorV2 (featuring random poses, rotations, and multi-object scenes) and leverages a three-stage curriculum training strategy to train MirrorFusion 2.0. This allows diffusion models to generate realistic mirror reflections for the first time, significantly outperforming prior methods in both synthetic and real-world scenarios.
Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection: Proposes RAPTA (training-time region-aware prompt variant augmentation based on object detection) and ADMCD (inference-time training-free multimodal copy detection with three-stream attention fusion) to address the training data memorization issue in text-to-image diffusion models in an end-to-end manner from both mitigation and detection perspectives.
MixerMDM: Learnable Composition of Human Motion Diffusion Models: MixerMDM is proposed as the first learnable composition technique for human motion diffusion models. It uses a Transformer-based Mixer module to predict dynamic mixing weights, learning to blend individual and interactive motion diffusion models via adversarial training to achieve fine-grained, controllable human-human interaction motion generation.
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling: This work integrates continuous image representations and discrete text representations into a unified autoregressive probabilistic modeling framework for the first time. It avoids information loss by replacing VQ discretization with a lightweight diffusion head, and derives v-prediction as the optimal parameterization to address numerical error issues in low-precision training.
MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices: Proposes MobilePortrait, the first one-shot neural head avatar animation method capable of running in real-time on mobile devices. By combining mixed explicit-implicit keypoints with precomputable appearance knowledge, it achieves SOTA-comparable quality (100–600+ GFLOPs) using only 16 GFLOPs.
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification: This paper proposes the Human Annotator Modeling (HAM) method. By extracting and clustering style features from human-annotated descriptions, it utilizes learnable prompts to enable MLLMs to simulate thousands of human annotation styles. Combined with Uniform Prototype Sampling (UPS) to further enhance style diversity, HAM automatically constructs a large-scale, high-quality text-image person ReID dataset, significantly improving the generalization ability of ReID models across multiple benchmarks.
Move-in-2D: 2D-Conditioned Human Motion Generation: Defines a new task of human motion generation conditioned on 2D scene images and text, constructs the 300K-scale HiC-Motion dataset, and generates motion sequences that naturally project onto the scene using an in-context conditioning diffusion Transformer, enabling downstream human video generation.
MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting: MTADiffusion addresses three key challenges in object inpainting simultaneously—semantic misalignment, structural distortion, and style inconsistency—by constructing a mask-text aligned dataset with 5 million images, jointly training the inpainting and edge prediction tasks, and employing a VGG Gram matrix-based style consistency loss. It achieves SOTA performance on BrushBench and EditBench.
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis: MCLD decouples the source person image into three focal conditions: face region, appearance texture, and overall image. It designs a Multi-Focal Condition Aggregation (MFCA) module to selectively inject different conditions at different stages of the UNet, effectively mitigating the face and texture detail degradation caused by LDM compression and achieving SOTA results on DeepFashion.
Multi-Group Proportional Representation for Text-to-Image Models: This paper proposes the Multi-Group Proportional Representation (MPR) metric to systematically measure representational bias across intersectional demographic groups in text-to-image models, and develops an optimization algorithm based on this metric to guide T2I models toward more balanced group representation while preserving generation quality.
Multi-party Collaborative Attention Control for Image Customization: Proposes MCA-Ctrl, a tuning-free image customization method that achieves high-quality text/image-conditioned subject editing and generation through collaborative self-attention control (SAGI + SALQ) over a three-way parallel diffusion process. A Subject Localization Module is also introduced to address subject leakage and confusion in complex scenarios.
Multitwine: Multi-Object Compositing with Text and Layout Control: This paper proposes Multitwine, the first generative model supporting simultaneous multi-object compositing guided by text and layouts. By jointly training the compositing and personalized generation tasks, and incorporating cross-attention/self-attention decoupling losses, it achieves natural interactions (e.g., hugging, playing guitar) when inserting multiple objects simultaneously; the user study indicates a preference rate of up to 97.1% for interaction realism.
MVPortrait: Text-Guided Motion and Emotion Control for Multi-View Vivid Portrait Animation: This paper proposes MVPortrait, a two-stage text-guided framework (Text2FLAME + FLAME2Video). By using the FLAME 3D parametric face model as an intermediate representation, it utilizes MotionDM and EmotionDM diffusion models to generate motion and expression parameter sequences, respectively. Subsequently, a multi-view video generation model is employed to transform the rendered FLAME sequences into realistic multi-view portrait animations. This represents the first approach to achieve controllable portrait animation compatible with text, audio, and video modalities simultaneously.
Navigating Image Restoration with VAR's Distribution Alignment Prior: This paper discovers that the next-scale prediction of the Visual AutoRegressive (VAR) model possesses an inherent multi-scale distribution alignment capability—low-scale restores global degradations (e.g., low-light, haze), while high-scale restores local degradations (e.g., noise, rain streaks). Based on this, the VarFormer framework is constructed, which adaptively selects scale priors via Degradation-Aware Enhancement (DAE) and fuses prior and degraded features via Adaptive Feature Transformation (AFT), outperforming existing multi-task methods across 6 restoration tasks.
Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models: This paper proposes FastProtect, the first latency-focused image protection framework. By replacing traditional image-by-image iterative optimization with pre-trained Mixture-of-Perturbations (MoP), combined with Multi-Layer Protection Loss to enhance training effects, as well as Adaptive Targeted Protection and Adaptive Protection Strength to optimize inference, FastProtect achieves real-time protection that is 175× faster than the existing fastest method PhotoGuard (0.04s vs 7s for processing a $512^2$ image on an A100 GPU), while maintaining comparable protection efficacy and superior invisibility.
Nested Diffusion Models Using Hierarchical Latent Priors: This paper proposes Nested Diffusion Models, which sequentially generate latent variables at different semantic levels using a series of coarse-to-fine diffusion models, conditioning each stage on the outputs of coarser stages. On ImageNet 256×256, with only a 25% increase in computational cost, it reduces the unconditional FID from 45.19 to 11.05 and the conditional FID to 3.97.
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis: Noise Diffusion proposes leveraging VQA score supervision from Large Vision-Language Models (VLMs) to optimize the initial noise of diffusion models. By utilizing a distribution-preserving noise update formula $z'_T = \sqrt{1-\gamma} z_T + \sqrt{\gamma} \sigma$ (guaranteeing $z'_T \sim \mathcal{N}(0,I)$) and gradient-guided noise selection, it improves the VQA Score by 19.3% on complex prompts, compatible with all SD versions and various VLMs.
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction: SkeletonDiffusion proposes a nonisotropic Gaussian diffusion model for 3D human motion prediction. It constructs a non-diagonal covariance matrix $\Sigma_N$ using the skeletal adjacency matrix (instead of the standard $I$), ensuring that the diffusion noise naturally conforms to the human skeletal topology. This reduces limb jitter from 0.52 to 0.26 and bone stretching (stretch) from 5.54 to 4.45.
Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability: MaskUNet discovers the counter-intuitive phenomenon in diffusion models that "setting certain U-Net parameters to zero actually enhances generation quality," and proposes a learnable binary mask based on timesteps and sample content to dynamically select parameters. This reduces COCO 2014 FID from 12.85 to 11.72 (+8.8%) and improves T2I-CompBench color binding from 0.375 to 0.699.
Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models: This paper exposes a "typographic attack" vulnerability in the vision modality of image generation models—where attackers can manipulate generation results by embedding text into input images. It systematically evaluates the ineffectiveness of existing defenses against such vision modality threats and proposes the VMT-IGMs dataset as an evaluation benchmark.
ObjectMover: Generative Object Movement with Video Prior: ObjectMover models the task of object movement in images as a sequence-to-sequence problem. By fine-tuning a video generation model, it leverages cross-frame object consistency priors. Combined with high-quality synthetic data pairs generated by a game engine and a multi-task learning strategy, it achieves realistic relighting, occlusion completion, and synchronized shadow/reflection editing in complex real-world scenes.
OFER: Occluded Face Expression Reconstruction: OFER utilizes two conditional diffusion models to generate the shape and expression coefficients of the FLAME parametric model, respectively, and integrates a ranking network to select the optimal shape from multiple candidates, achieving diverse and realistic 3D facial expression reconstruction under occlusion.
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows: OmniFlow extends the rectified flow framework of Stable Diffusion 3 to joint multimodal (text + image + audio) generation scenarios. Through a modular Omni-Transformer architecture and a novel multimodal guidance mechanism, it achieves superior generation quality compared to previous any-to-any models like CoDi and UniDiffuser without requiring training from scratch.
OmniGen: Unified Image Generation: The first general-purpose image generation foundation model, composed solely of a VAE and a Transformer. It achieves end-to-end processing of multiple tasks, including text-to-image, image editing, and controllable generation, on top of any interleaved multimodal input.
OmniStyle: Filtering High Quality Style Transfer Data at Scale: Built the first million-scale paired style transfer dataset OmniStyle-1M (1 million content-style-stylized triplets, 1000 styles), designed the OmniFilter multidimensional quality filtering framework to screen high-quality data, and trained an end-to-end style transfer model OmniStyle based on the DiT architecture. It supports both instruction-guided and reference-guided style transfer, comprehensively outperforming existing methods.
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers: Reveals that DiT computation is uniformly distributed across spatial tokens (failing to reallocate redundant computation to difficult regions), and proposes ELIT. ELIT inserts a variable-length latent interface (Read/Write cross-attention) in DiT, randomly drops tail latents during training to learn an importance ordering, and adjusts the number of latents during inference to achieve a smooth quality-FLOPs trade-off, reducing FID by 53% on ImageNet 512px.
OpenSDI: Spotting Diffusion-Generated Images in the Open World: OpenSDI defines the open-world diffusion image detection challenge, constructs a large-scale dataset OpenSDID containing multi-VLM-generated instructions and various diffusion models, and proposes MaskCLIP—which synergizes CLIP and MAE through a Synergizing Pretrained Models (SPM) framework, significantly outperforming existing methods on both detection and localization tasks.
Optimizing for the Shortest Path in Denoising Diffusion Model: Models the denoising process of diffusion models as a shortest path problem in graph theory. By optimizing the initial residual to compress the reverse diffusion path, it achieves a generation quality with 2-step sampling that matches or even exceeds that of 10-step DDIM.
ORIDa: Object-Centric Real-World Image Composition Dataset: ORIDa constructs the first large-scale, real-shot, and publicly available object composition dataset containing over 30,000 images of 200 unique objects (including fact-counterfactual pairs and multi-position variations). It validates the dataset's efficacy on object removal and insertion tasks via fine-tuning on StableDiffusion-Inpaint.
OSDFace: One-Step Diffusion Model for Face Restoration: OSDFace proposes the first one-step diffusion model specifically designed for face restoration. By extracting rich prior information from low-quality faces using a Visual Representation Embedder (VRE), and combining this with facial identity loss and GAN guidance, it generates high-fidelity, natural, and identity-consistent face images in just a single step of inference (~0.1 second), comprehensively outperforming existing SOTA methods.
Panorama Generation From NFoV Image Done Right: Discovered the "visual cheating" phenomenon in existing panorama generation methods (sacrificing distortion accuracy for visual quality). Proposed PanoDecouple, a decoupled framework that divides panorama generation into distortion guidance (DistortNet) and content completion (ContentNet), achieving optimal performance in both distortion and visual quality with only 3K training samples.
Parallel Sequence Modeling via Generalized Spatial Propagation Network: GSPN proposes the Generalized Spatial Propagation Network, which achieves a natively 2D spatially-aware sub-quadratic attention mechanism through 2D linear propagation of row/column line scans under a stability-context condition. This reduces the effective sequence length to $\sqrt{N}$ and accelerates SD-XL by up to 84x in 16K image generation.
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation: This paper proposes PatchDPO, which replaces the image-level preference evaluation of traditional DPO with patch-level quality estimation to optimize pretrained personalized generation models in a second-stage training. It achieves SOTA performance in both single-subject and multi-subject generation on the DreamBooth and Concept101 datasets.
Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy: Pattern Analogies proposes a framework for structured editing of pattern images without inferring the underlying program: users demonstrate the desired editing operation through a pair of simple patterns $(A, A')$, and the TriFuser diffusion model transfers this editing to a complex target pattern $B$ to generate $B'$, faithfully executing and generalizing to unseen pattern styles designed by real-world artists.
PCM: Picard Consistency Model for Fast Parallel Sampling of Diffusion Models: PCM proposes the Picard Consistency Model to accelerate the parallel sampling of diffusion models via Picard iteration. By training a model to directly predict the fixed-point solution and introducing a model switching mechanism to ensure exact convergence, it achieves up to a 2.71x speedup in image generation and robot control tasks.
Personalized Preference Fine-tuning of Diffusion Models: PPD proposes a personalized preference diffusion model fine-tuning framework: it leverages a VLM to extract user embeddings from a small number of (4 pairs) preference examples, injects them into the diffusion model via decoupled cross-attention layers, and optimizes multi-user personalized preferences simultaneously using a DPO objective. It requires only 4 preference pairs to generate preference-aligned images for a new user (76% win rate).
PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?: This work introduces the PhysicsGen benchmark, which contains 300,000 image pairs covering three physical simulation tasks (acoustic wave propagation, lens distortion, and rolling/bouncing dynamics). It systematically evaluates the ability of generative models to learn physical relations, revealing that physical relations described by high-order differential equations present a fundamental challenge to current models.
PICD: Versatile Perceptual Image Compression with Diffusion Rendering: PICD proposes a versatile perceptual image compression framework. By losslessly encoding text information and "rendering" it with the compressed image using a diffusion model, the method improves the conditional diffusion model across three levels (domain level, adapter level, and instance level), simultaneously achieving high visual quality and high text accuracy for both screen content and natural images.
Pippo: High-Resolution Multi-View Humans from a Single Image: Pippo proposes a multi-view Diffusion Transformer that generates 1K-resolution human turnaround videos from a single captured snapshot. Through a three-stage training strategy (pre-training on 3 billion human images + mid-training + post-training) and an inference-time attention bias technique, it achieves the capability to generate over 5 times the number of training views.
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction: PQPP is proposed, the first joint benchmark for text-to-image prompt and query performance prediction, containing over 10K queries and 1.6 million human annotations. It is found that query difficulty in generation and retrieval is almost uncorrelated (Pearson correlation is only 0.135).
Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters: This paper proposes AdaVD (Adaptive Value Decomposer), a training-free concept erasure method for T2I diffusion models. By projecting the original prompt onto the orthogonal complement space of the target concept within the value space of the cross-attention, and introducing an adaptive shift factor, it achieves precise erasure of the target concept while minimally affecting non-target content.
PreciseCam: Precise Camera Control for Text-to-Image Generation: PreciseCam achieves precise camera perspective control in text-to-image generation through 4 camera parameters (roll, pitch, vFoV, distortion $\xi$) and Perspective Field-Unified Spherical representation, without requiring 3D geometry or multi-view data.
Probability Density Geodesics in Image Diffusion Latent Space: This paper demonstrates that probability-density-based geodesics can be computed in the latent space of diffusion models, where paths traversing high-probability density regions are "shorter" than those through low-density regions. It also showcases the application of this technique in video approximation analysis, training-free image sequence interpolation, and extrapolation.
ProReflow: Progressive Reflow with Decomposed Velocity: This paper proposes progressive Reflow (gradually straightening diffusion trajectories from multi-window to few-window) and aligned v-prediction (prioritizing direction over magnitude in velocity matching), enabling SDv1.5 to achieve generation quality close to 32-step DDIM with only 4-step sampling.
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction: This paper proposes DPIDM (Dynamic Pose Interaction Diffusion Models), which injects synchronized human and garment skeleton poses into the denoising network via a pose adapter. A hierarchical attention module is designed to model intra-frame human-garment pose spatial interactions and inter-frame human pose temporal dynamics. Combined with a temporal regulative attention loss to enhance temporal consistency, the method achieves a VFID of 0.506 on the VVT dataset, representing a 60.5% improvement over the SOTA.
RAD: Region-Aware Diffusion Models for Image Inpainting: RAD achieves region-asynchronous generation by assigning distinct noise schedules to different pixels. With minimal structural modifications to the vanilla diffusion model (replacing FC layers with $1\times 1$ convolutions), it achieves state-of-the-art (SOTA) inpainting quality while accelerating inference speed by 100x.
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders: This paper proposes RandAR—the first decoder-only visual autoregressive model that supports arbitrary token generation orders. By inserting a "position instruction token" before each image token to specify the spatial position of the next token to be generated, it unlocks novel capabilities including parallel decoding (2.5x speedup), zero-shot inpainting/outpainting, and resolution extrapolation, without sacrificing performance.
Random Conditioning for Diffusion Model Compression with Distillation: This paper proposes Random Conditioning, a technique that pairs noisy images with randomly selected, unrelated text conditions during the knowledge distillation of conditional diffusion models. This allows the student model to explore the full condition space without needing to generate corresponding images for each text, achieving highly efficient image-free or image-scarce diffusion model compression while enabling the student to generate concepts unseen during training.
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories: The RayFlow diffusion framework is proposed, which designs a unique diffusion trajectory (pointing to an instance-specific target distribution) for each sample and optimizes training via Time Sampler importance sampling, maintaining generation diversity and stability while minimizing sampling steps.
Re-HOLD: Video Hand Object Interaction Reenactment via Adaptive Layout-instructed Diffusion Model: This work proposes Re-HOLD, the first human-centric hand-object interaction (HOI) video reenactment framework, which decouples hand and object modeling through a decoupled layout representation, and combines an interactive texture enhancement module with an adaptive layout adjustment strategy to achieve high-fidelity HOI video generation across different objects.
Rectified Diffusion Guidance for Conditional Generation: ReCFG theoretically reveals that the sum-to-one constraint of the two coefficients in standard Classifier-Free Guidance (CFG) leads to an expectation shift in the generated distribution. By relaxing the coefficient constraint and deriving a closed-form solution for $\gamma_0$, it provides a training-free post-processing scheme with virtually no extra inference overhead to rectify the guidance effect of CFG.
Redefining in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation: CreTok redefines "creative" as a learnable, general token <CreTok>. By continuously and space-iteratively optimizing the semantics of this token in the text embedding space, it endows diffusion models with "meta-creativity" for compositional creative generation. This enables the zero-shot generation of diverse concept-blended images without additional training, achieving a generation speed 10-30 times faster than current state-of-the-art (SOTA) methods.
ReNeg: Learning Negative Embedding with Reward Guidance: ReNeg proposes directly learning negative embeddings in a continuous text embedding space guided by a reward model, replacing handcrafted negative prompts. By optimizing a minimal number of parameters, it matches the generation quality of full-model fine-tuning methods on the HPSv2 benchmark. Furthermore, the learned embeddings can be directly transferred to other T2I and T2V models.
Reversing Flow for Image Restoration: ResFlow proposes modeling the image degradation process as a deterministic continuous normalizing flow (rather than a stochastic diffusion process). By resolving the irreversibility of degradation using auxiliary variables, it achieves reversible modeling. Implementing an entropy-preserving schedule, it completes high-quality image restoration in only 4 sampling steps, achieving SOTA on tasks such as desnowing, deraining, dehazing, denoising, and deblocking.
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward: This paper proposes LaSRO, which learns a differentiable surrogate reward model in the latent space to transform any (including non-differentiable) reward signal into differentiable gradient guidance. This achieves efficient reward fine-tuning for two-step diffusion models, significantly outperforming mainstream reinforcement learning methods such as DDPO and DPO.
RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing: RoomPainter is proposed, which adapts 2D diffusion models into a 3D-consistent indoor scene texture synthesis tool via zero-shot Multi-View Integrated Sampling (MVIS) and Correlated View Attention, employing a two-stage strategy to ensure global and local consistency.
RORem: Training a Robust Object Remover with Human-in-the-Loop: RORem proposes a "Human-in-the-Loop" semi-supervised data generation paradigm. It first generates removal results using an initial model, leverages human annotators to filter high-quality samples, and then trains a discriminator to automate subsequent filtering. This iteratively constructs a 200K+ high-quality paired object removal dataset, leading to a fine-tuned SDXL model that outperforms prior methods by 18%+ in success rate. After distillation, the model requires only 4 steps (<1 second) for generation.
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing: This work proposes SALAD, a skeleton-aware latent diffusion model that explicitly models fine-grained interactions among joints, frames, and text using a skeleton-temporal structured VAE and denoiser, and achieves zero-shot text-driven motion editing via cross-attention maps.
SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer: SaMam is proposed as the first arbitrary image style transfer framework based on the Mamba state space model. By predicting SSM weight parameters from style embeddings via a style-aware S7 block, and combining this with zigzag scanning and local enhancement mechanisms, it achieves the optimal balance between stylization quality and efficiency.
Scaling Down Text Encoders of Text-to-Image Diffusion Models: This paper distills the T5-XXL (11B) text encoder into T5-Base (220M) using a vision-based knowledge distillation method. While reducing the size by 50x, it incurs almost no loss in image quality and semantic understanding, revealing that text encoders in text-to-image tasks exhibit severe over-parameterization and a "downward scaling law."
Science-T2I: Addressing Scientific Illusions in Image Synthesis: Science-T2I constructs a benchmark of 20k+ adversarial image pairs covering 16 scientific domains, revealing systematic deficiencies of current image generation models in implicit scientific reasoning (all models score below 50/100). It proposes the SciScore reward model and a two-stage alignment framework (SFT+OFT), improving the scientific reasoning capability of FLUX.1[dev] by over 50%.
ScribbleLight: Single Image Indoor Relighting with Scribbles: ScribbleLight proposes a scribble-guided generative model for single-image indoor relighting. It preserves the original texture and color using Albedo-conditioned Stable Image Diffusion, and introduces an encoder-decoder ControlNet architecture to achieve geometry-preserving, fine-grained illumination control. Users can easily perform actions such as turning lights on/off and casting shadows using simple scribbles.
SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer: A plug-and-play Semantic Continuous-Sparse Attention (SCSA) module is proposed to achieve semantic style transfer. It ensures style consistency within the same semantic region via Semantic Continuous Attention (SCA) and preserves original texture details via Semantic Sparse Attention (SSA). SCSA can be integrated into any attention-based style transfer method without requiring additional training.
See Further When Clear: Curriculum Consistency Model: This paper proposes the Curriculum Consistency Model (CCM). It reveals that the training difficulty (knowledge discrepancy) across different timesteps during consistency distillation is highly imbalanced. By dynamically adjusting the iteration steps of the teacher model using a PSNR-based KDC metric to maintain a consistent curriculum difficulty, CCM achieves a single-step FID of 1.64 on CIFAR-10 and successfully scales to SDXL and SD3.
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects: This paper proposes Self-Cross Diffusion Guidance, which effectively addresses the subject mixing problem when generating similar subjects with diffusion models by penalizing the overlap between the aggregated self-attention map of one subject and the cross-attention map of another. It represents the first training-free method to simultaneously leverage the interactions between self-attention and cross-attention.
Self-Supervised ControlNet with Spatio-Temporal Mamba for Real-World Video Super-Resolution: The SCST framework is proposed, which utilizes Spatio-Temporal Continuous Mamba (STCM) for global 3D attention modeling, combines it with a MoCo-based self-supervised ControlNet to extract degradation-agnostic features, and incorporates a three-stage hybrid training strategy to achieve SOTA perceptual quality on real-world video super-resolution benchmarks.
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion: SemanticDraw proposes a sub-second (0.64 seconds) regional multi-prompt text-to-image generation framework. It resolves the compatibility issues between regional control and diffusion acceleration methods through three stabilization strategies, and achieves near real-time interactive generation on a single RTX 2080 Ti using a multi-prompt streaming batching pipeline.
SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization: SGMatch proposes a semantic-guided non-rigid 3D shape matching framework that integrates semantic features from a vision foundation model into geometric descriptors via a Semantic-Guided Local Cross-Attention (SGLCA) module to eliminate symmetry ambiguity, and introduces a Conditional Flow Matching (CFM) regularization to promote spatial smoothness of correspondences, achieving consistent improvements under non-isometric deformations and topological noise (outperforming the previous SOTA by 24% on SMAL).
ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts: Proposed ShapeWords, which encodes 3D shapes into special tokens embeddable within text prompts (Shape2CLIP module). This enables viewpoint-agnostic 3D shape-guided text-to-image generation, significantly outperforming ControlNet depth-map conditioning in compositional settings.
Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model: This work introduces diffusion models to the virtual try-on task of ornaments (bracelets, rings, earrings, necklaces) for the first time. It proposes an iterative pose-aware wearing-mask prediction and a mask-guided attention mechanism, achieving high-fidelity geometric structure preservation under challenging pose and scale variations.
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions: This paper proposes ShowHowTo, a video diffusion model capable of generating a sequence of step-by-step visual instructions consistent with a user-provided initial scene image and textual instructions. It also constructs a large-scale instructional dataset containing 578k sequences, collected from online instructional videos via a fully automated pipeline.
SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model: SIR-Diff is proposed, a multi-view diffusion model that achieves cross-view consistent image restoration by jointly denoising multiple degraded images of the same scene. It integrates multi-view complementary information using a Spatial-3D ResNet and a 3D Self-Attention Transformer, outperforming single-view and video restoration methods in deblurring and super-resolution tasks.
Six-CD: Benchmarking Concept Removals for Text-to-Image Diffusion Models: This work proposes the Six-CD benchmark, containing six categories of undesired concepts (harmful, nudity, celebrity, copyrighted character, object, and artistic style) and a new evaluation metric, the in-prompt CLIP score, to systematically evaluate and compare concept removal methods in text-to-image diffusion models for the first time.
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-Image Diffusion Models: SleeperMark proposes a robust watermarking framework for T2I diffusion models. By explicitly decoupling watermarking information from the model's semantic knowledge, the watermark remains reliably detectable even after downstream fine-tuning (such as LoRA, DreamBooth, and ControlNet), maintaining a TPR@10⁻⁶FPR of over 0.93 under various fine-tuning attacks.
SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device: SnapGen-V proposes a comprehensive acceleration framework for mobile video diffusion models. By pruning an efficient spatial backbone, determining temporal layer design via a joint latency-memory architecture search, and utilizing specialized adversarial fine-tuning to reduce denoising steps to 4, the model finishes generating a 5-second video in under 5 seconds on an iPhone 16 with 0.6B parameters. This represents the first work to achieve real-time text-to-video generation on mobile devices.
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer: SoftVQ-VAE achieves a fully differentiable continuous image tokenizer by replacing the hard categorical posterior of VQ-VAE with a soft categorical posterior (where each latent token adaptively aggregates multiple codewords). It compresses 256×256 and 512×512 images to extremely high compression ratios using only 32-64 1D tokens, enabling SiT-XL to achieve a 1.78 FID on ImageNet with an 18-55x increase in inference throughput.
STORM: Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis: STORM proposes a Spatial Transport Optimization (STO) method based on optimal transport theory, which dynamically adjusts the spatial positions of object attention maps during the denoising process of diffusion models. Without requiring any training, it achieves precise spatial layout control, effectively solving the overlooked key issue of "mislocated objects" in T2I models.
Stable Flow: Vital Layers for Training-Free Image Editing: Stable Flow proposes to automatically detect "vital layers" in DiT (FLUX) and inject attention features of the reference image only into these layers to achieve various training-free image editing operations, while introducing a latent nudging technique to improve the quality of flow model inversion for real images.
StableAnimator: High-Quality Identity-Preserving Human Image Animation: StableAnimator proposes the first end-to-end identity-preserving video diffusion framework. It maintains identity consistency during training via a global content-aware Face Encoder and a distribution-aware ID Adapter, and optimizes facial quality during inference using the Hamilton-Jacobi-Bellman (HJB) equation, enabling the generation of high-fidelity human animation videos without any post-processing tools.
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget: MicroDiT proposes a deferred masking strategy—pre-processing all patches with a lightweight patch-mixer before masking $75\%$ of them—along with layer-wise width scaling, Mixture-of-Experts (MoE), and synthetic data. With a cost of only $1,890, it trains a 1.16B-parameter sparse Transformer from scratch in 2.6 days, achieving a 12.7 FID on COCO, which is only 1/118 of the training cost of Stable Diffusion.
StyleMaster: Stylize Your Video with Artistic Generation and Translation: StyleMaster achieves high-quality video stylization generation and transfer with style consistency and content preservation by combining local texture selection based on prompt-patch similarity, contrastive global style extraction derived from model illusion generation, a motion adapter, and a grayscale Tile ControlNet.
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements: StyleStudio proposes three complementary strategies—cross-modal AdaIN, style-based classifier-free guidance (SCFG), and a teacher model—to address style overfitting, text misalignment, and layout instability in text-driven style transfer, achieving selective control of style elements.
SVFR: A Unified Framework for Generalized Video Face Restoration: This paper proposes SVFR, a unified video face restoration framework based on Stable Video Diffusion, which jointly trains three tasks—blind face restoration (BFR), colorization, and inpainting—within a single model. Through designs such as task embedding, unified latent space regularization, and facial prior learning, it achieves SOTA results across multiple video face restoration tasks.
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion: This paper proposes SwiftEdit, the first text-guided image editing tool based on a one-step diffusion model. By leveraging a one-step inversion network trained in two stages and an attention-rescaling mask editing technique, it achieves image editing within 0.23 seconds, which is at least 50 times faster than multi-step methods.
Symbolic Representation for Any-to-Any Generative Tasks: This paper proposes a symbolic generative task description language (A-Language) and a training-free inference engine. It maps natural language instructions into executable symbolic flows composed of function-parameter-topology triplets, achieving unified processing across 12 categories of multimodal generation tasks and matching or exceeding end-to-end trained unified multimodal models in quality and flexibility.
SyncSDE: A Probabilistic Framework for Diffusion Synchronization: SyncSDE proposes a probabilistic theoretical framework to analyze and improve diffusion synchronization. By decomposing the synchronization process into a "prior score function" and "inter-trajectory correlation modeling," it reveals that heuristic strategies should focus on correlation modeling. This enables the formulation of optimal synchronization strategies across tasks using a single hyperparameter $\lambda$, outperforming SyncTweedies in various tasks such as mask-based T2I, wide image generation, image editing, visual anagrams, and 3D texturing.
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction: Proposes the SyncVP multi-modal video prediction framework, which synchronously generates future RGB and depth frames using a dual-branch diffusion model coupled with highly efficient spatio-temporal cross-modal attention. By utilizing innovative shared noise and cross-modality guidance training strategies, it achieves SOTA performance on Cityscapes while supporting partial-modality inputs.
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework: Proposes a three-stage AC-DC denoiser (Auto-Correction + Directional Correction + Score-Based Denoising) to address the manifold mismatch issue when embedding score-based diffusion priors into the ADMM-PnP framework. It establishes the first theoretical convergence guarantee for score-based denoisers in ADMM and consistently out-performs existing baselines in inverse problems such as denoising, inpainting, deblurring, super-resolution, phase retrieval, and HDR.
Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting: The TMTB framework is proposed. By enhancing background diversity via diffusion model inpainting, introducing a VMamba backbone, and leveraging an anti-noise classification branch, it reduces the MAE on JHU-Crowd++ to 67.0 using only 5% of labeled data, substantially surpassing the state-of-the-art in semi-supervised crowd counting.
TCFG: Tangential Damping Classifier-Free Guidance: From the perspective of data manifold geometry, this work utilizes SVD to remove the unaligned tangential component in the unconditional score relative to the conditional score, improving CFG sampling quality with minimal computational overhead. It consistently reduces FID across SD1.5, SDXL, SD3, and DiT.
Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts: This work identifies a three-stage (Profiling-Mutation-Refinement) process in diffusion generation and the "score trap" mechanism responsible for artifact formation. It proposes ASCED, which monitors anomalous score dynamics to detect and correct artifacts in real-time without training, matching or exceeding supervised methods.
The Art of Deception: Color Visual Illusions and Diffusion Models: This paper discovers that the intermediate representations of diffusion models (particularly during the DDIM inversion process) naturally produce luminance/color shifts consistent with human perception. Based on this, a method for generating novel visual illusions using text-to-image diffusion models is proposed, with psychophysical experiments demonstrating that the generated illusions successfully deceive humans.
Tiled Diffusion: Tiled Diffusion is proposed to enable seamless and coherent tileable image generation across various tiling scenarios ranging from self-tiling to complex many-to-many connections, by introducing tiling and similarity constraints directly in the latent space of diffusion models.
TinyFusion: Diffusion Transformers Learned Shallow: This paper proposes TinyFusion, a learnable depth pruning method. By utilizing Gumbel-Softmax differentiable sampling for layer masking and co-optimizing weight updates to simulate fine-tuning, it explicitly optimizes the recoverability of the pruned model (rather than minimizing post-pruning loss). On DiT-XL, it constructs shallow Diffusion Transformers at less than 7% of the pre-training cost, achieving a 2× speedup with an FID of only 2.86.
TKG-DM: Training-Free Chroma Key Content Generation Diffusion Model: This paper proposes TKG-DM, which controls the background color of generated images by manipulating the channel mean of the initial noise in diffusion models, and combines this with a Gaussian mask to separate the foreground from the chroma key background, generating high-quality green screen/chroma key images without any fine-tuning.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation: Proposes TokenFlow, a unified image tokenizer that decouples semantic and pixel-level feature learning through a dual-codebook + shared-mapping architecture, achieving discrete visual inputs outperforming LLaVA-1.5 13B (+7.2%) for the first time while reaching SOTA GenEval of 0.55 in autoregressive generation.
Towards Scalable Human-Aligned Benchmark for Text-Guided Image Editing: Proposes HATIE, a large-scale (18K images/50K queries), fully automated, multi-dimensional text-guided image editing evaluation benchmark, which aligns with human perception by combining metrics from 5 dimensions and fitting user study weights.
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance: Proposes Self-Coherence Guidance (SCG), a training-free alignment method tailored for Transformer-structured text-guided diffusion models, which improves attribute binding, fine-grained attribute binding, and style binding by directly optimizing cross-attention maps (rather than latent variables).
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation: First systematic quantification of text-to-image generation model uncertainty relative to prompts, proposing the PUNC method—using LVLMs to caption generated images and compare them to the original prompts in the text space, separating epistemic and aleatoric uncertainties via precision/recall.
Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?: The TrainProVe method is proposed based on generalization error bound theory. It verifies whether a suspicious model was trained on synthetic data from a specific generative model using shadow model training and hypothesis testing, achieving an accuracy of over 99%.
Traversing Distortion-Perception Tradeoff Using a Single Score-Based Generative Model: This paper proposes a variance-scaled reverse diffusion process that controls the variance of reverse sampling via a single parameter $\lambda \in [0,1]$. This allows a single pre-trained score network to flexibly traverse the optimal solution of the distortion-perception tradeoff curve, with its optimality mathematically proven under conditional Gaussian distributions.
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation: This work proposes the FIRM framework, which trains specialized reward models (FIRM-Edit-8B / FIRM-Gen-8B) via "difference-first" (for editing) and "plan-and-score" (for generation) data construction pipelines. Combined with a "Base-and-Bonus" reward strategy (CME/QMA) to resolve reward hacking in RL, it achieves SOTA results on both image editing and T2I generation tasks.
TurboFill: Adapting Few-Step Text-to-Image Model for Fast Image Inpainting: TurboFill proposes a three-step adversarial training scheme to train an inpainting adapter (ControlNet architecture) directly on the few-step distilled diffusion model DMD2. It achieves high-quality image inpainting that outperforms multi-step BrushNet in just 4 inference steps, reducing training costs by over 10 times.
UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models: UIBDiffusion proposes the first imperceptible backdoor attack method for diffusion models. By repurposing universal adversarial perturbation (UAP) as a backdoor trigger, it achieves a triple threat of universality (image- and model-agnostic), practicality (high attack success rate without affecting generation quality), and undetectability (bypassing state-of-the-art defenses like Elijah and TERD).
UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion: UltraFusion reformulates exposure fusion as a guided inpainting problem for the first time. By leveraging under-exposed images as soft guidance rather than hard constraints for over-exposed regions, it achieves ultra-high dynamic range imaging with a 9-stop exposure difference while maintaining robustness against alignment errors and illumination variations.
Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model: This work discovers that different regions of LR images (flat areas vs. textured edge regions) correspond to different timesteps in the diffusion process, and proposes an Uncertainty-guided Noise Weighting (UNW) strategy. UNW applies less noise to flat regions to preserve crucial LR information, achieving state-of-the-art (SOTA) super-resolution performance with a smaller model size and lower training cost.
Uni-Renderer: Unifying Rendering and Inverse Rendering via Dual Stream Diffusion: Uni-Renderer proposes a unified framework based on a dual-stream diffusion model, which formulates rendering (from intrinsic attributes to RGB images) and inverse rendering (from RGB images to intrinsic attributes) as two conditional generation tasks. By utilizing cycle-consistency constraints, it mitigates the inherent ambiguity in inverse rendering, achieving outcomes superior to existing methods in material decomposition and rendering editing.
UNIC-Adapter: Unified Image-Instruction Adapter with Multi-modal Transformer for Image Generation: Based on the MM-DiT architecture, UNIC-Adapter designs a unified image-instruction adapter. Through a cross-attention mechanism and RoPE-enhanced spatial-aware injection, a single SD3 model is enabled to handle 14 conditional image generation tasks, including pixel-level control, subject-driven generation, and style transfer.
UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations: UniCom is proposed to construct a compact continuous representation space by performing channel-wise compression (rather than spatial downsampling) on continuous semantic features from VLMs, unifying multimodal understanding and generation within a Transfusion architecture to achieve SOTA generation quality in a unified model.
Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling: U2Diff is proposed as a unified diffusion model framework capable of simultaneously handling multi-agent trajectory completion and prediction tasks. It provides state-wise uncertainty estimation through an augmented denoising loss and introduces a Rank Neural Network to rank the error probabilities of multi-modal predictions.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics: UniReal proposes to unify various image generation and editing tasks into a "discontinuous frame generation" framework. By leveraging video data as a scalable, universal source of supervision, it achieves unified handling of multiple tasks such as instruction-based editing, customized generation, and object insertion within a single diffusion Transformer through hierarchical prompting and text-image association mechanisms.
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing: To achieve tuning-free image editing based on Flow Transformer (MM-DiT), this paper proposes a two-stage flow inversion method (fixed-point iteration + velocity field compensation) and an Adaptive Layer Normalization (AdaLN)-based invariance control mechanism to uniformly support both rigid and non-rigid editing operations.
Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing: By embedding the powerful prior knowledge of pre-trained diffusion models into Deep Unfolding Networks (DUNs), this work proposes the DMP-DUN method, enabling high-quality image compressive sensing reconstruction in only 2 steps.
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration: Redefines image restoration as a progressive video generation process, leveraging pre-trained video generative priors (Wan2.2-TI2V-5B) to achieve competitive multi-task image restoration using only 1,000 multi-task training samples (less than 2% of existing methods).
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness: This paper proposes VerbDiff, a text-to-image diffusion model that generates accurate human-object interaction images without requiring extra conditions (such as bounding boxes). It eliminates interaction verb bias using Relation Decoupled Guidance (RDG) and extracts local interaction regions from cross-attention maps via an Interaction Region Module (IR Module) for directional guidance.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos: VideoWorld explores whether purely visual video generation models can learn complex knowledge (rules, reasoning, planning) from unlabeled videos. It proposes a Latent Dynamics Model (LDM) to compress multi-step visual changes, achieving a professional 5-dan level in Go with only 300 million parameters.
Visual-ERM: Reward Modeling for Visual Equivalence: Proposes Visual-ERM, a multimodal generative reward model that directly evaluates rendering quality of vision-to-code tasks in the visual space, providing fine-grained, interpretable, and task-agnostic reward signals for RL training and test-time scaling.
Visual Lexicon: Rich Image Features in Language Space: ViLex proposes a visual encoder that encodes images into the text vocabulary space. Through self-supervised training using a frozen text-to-image (T2I) diffusion model, the generated image tokens capture both high-level semantics and fine-grained visual details, outperforming conventional methods in both image reconstruction and visual understanding tasks.
Visual Persona: Foundation Model for Full-Body Human Customization: Visual Persona is proposed as the first foundation model for full-body human customization. Through large-scale paired dataset curation (580K images / 100K identities) and a body-part partitioned Transformer decoder architecture, it achieves high-fidelity full-body appearance preservation and diverse text-guided generation.
ViUniT: Visual Unit Tests for More Robust Visual Programming: ViUniT proposes a framework for automatically generating visual unit tests. By utilizing an LLM to generate image descriptions and expected answers, and a text-to-image model to generate test images, the framework verifies the logical correctness of visual programs. This elevates 7B open-source models to surpass gpt-4o-mini and reduces "right-for-the-wrong-reason" programs by 40%.
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary: Proposes VLog, which defines video narration as vocabulary units, achieving efficient video understanding that is 10-20 times faster than generative VideoLLMs through a generative retrieval architecture (GPT-2 reasoning + SigLIP retrieval).
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis: VLOGGER is the first approach to generate full-body talking portrait videos, including facial expressions and upper-body gestures, from a single portrait image and an audio input. Through a two-stage diffusion model pipeline (audio $\to$ 3D motion $\to$ video), it achieves high-quality, variable-length human video synthesis, outperforming existing methods on three public benchmarks.
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat: A unified framework named WeGen is proposed, integrating multimodal understanding and visual generation into a single model. By leveraging a Dynamic Instance Identity Consistency (DIIC) data pipeline and a Prompt Self-Rewriting (PSR) mechanism, it addresses the twin challenges of reference image consistency preservation and generation diversity, achieving an interactive experience akin to a conversational design assistant.
Where's the Liability in the Generative Era? Recovery-Based Black-Box Detection of AI-Generated Content: This paper proposes a black-box AI-generated image detection method based on a "corrupt-and-recover" strategy. The core hypothesis is that a generative model can more easily recover the masked parts of its own generated images. By fine-tuning a surrogate model with distribution alignment, detection accuracy on unknown target models is further enhanced, requiring less than 1,000 API samples and 2 hours of GPU time.
Yo'Chameleon: Personalized Vision and Language Generation: Yo'Chameleon is proposed to explore the personalization of Large Multimodal Models (LMMs) for the first time. Through a dual soft prompt + self-prompting mechanism along with a "soft-positive" training strategy, it achieves personalized text understanding and image generation using only 3-5 images and 32 learnable tokens.
Z-Magic: Zero-shot Multiple Attributes Guided Image Creator: This work proposes the Z-Magic framework, which reformulates attribute dependencies in multi-attribute image generation from a conditional probability perspective. By introducing conditional-dependent gradient guidance and multi-task learning optimization, it achieves coherent multi-attribute generation under a zero-shot setting.
Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond): CM4IR proposes a zero-shot image restoration scheme based on Consistency Models (CM). By combining a novel noise injection mechanism (decoupled denoising/injected noise levels + randomized/estimated noise splitting) with back-projection guidance and improved initialization, it surpasses existing diffusion model methods that require 20–1000 steps using only 4 neural network evaluations (NFEs).
Emuru: Zero-Shot Styled Text Image Generation, but Make It Autoregressive: Proposes Emuru, the first autoregressive model for handwritten text image generation (HTG), combining a specialized VAE and a T5 Transformer encoder-decoder. Trained solely on synthetic data with 100k+ fonts, it generalizes zero-shot to unseen handwritten styles and supports arbitrary-length text generation.
ZoomLDM: Latent Diffusion Model for Multi-Scale Image Generation: ZoomLDM proposes a scale-conditioned latent diffusion model that constructs a cross-magnification latent space through a trainable Summarizer module, achieving high-quality pathology image generation across multiple scales. It represents the first work to support globally consistent large image synthesis up to $4096 \times 4096$ pixels as well as training-free super-resolution.