🎨 Image Generation¶

📹 ICCV2025 · 219 paper notes

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation: This paper proposes A₀, an affordance-aware hierarchical diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding (predicting contact points and trajectories) and low-level action execution. Pretrained on 1M contact point data and fine-tuned with minimal task-specific data, A₀ achieves cross-platform deployment across Franka/Kinova/Realman/Dobot, reaching a 45% success rate on complex trajectory tasks such as whiteboard wiping.
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation: This paper proposes A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial understanding and low-level action execution by predicting object-centric contact points and post-contact trajectories via an Embodiment-Agnostic Affordance Representation. Pre-trained on 1 million contact-point annotations, A0 generalizes across four robot platforms: Franka, Kinova, Realman, and Dobot.
A Unified Framework for Motion Reasoning and Generation in Human Interaction: This paper proposes MoLaM, a unified interactive motion-language model that, through a three-stage training strategy and a newly constructed Inter-MT² dataset (82.7K multi-turn instructions), is the first to simultaneously achieve understanding, generation, editing, and reasoning of dyadic interaction motion within a single framework.
Accelerating Diffusion Sampling via Exploiting Local Transition Coherence: This paper proposes LTC-Accel, a training-free diffusion sampling acceleration method based on the phenomenon of Local Transition Coherence (LTC). By exploiting the strong correlation between transition operators of adjacent denoising steps, the method approximates the current step's computation using the previous step's transition operator. It achieves 1.67× speedup on Stable Diffusion v2, and combined with distilled models, reaches 10× acceleration in video generation.
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Models and Small Edge Models: This paper proposes RouteT2I, the first edge-cloud model routing framework for text-to-image generation. It maximizes image generation quality under cost constraints through multi-dimensional quality metrics, Pareto Relative Superiority, and a dual-gated token selection MoE architecture.
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model: This paper proposes RouteT2I, a framework that dynamically routes text-to-image generation requests to either a lightweight edge model or a large cloud model via multi-dimensional quality assessment metrics and a dual-gate token-selection MoE routing model, achieving 83.97% of the quality gain attainable by exclusively using the cloud model at a 50% routing rate.
Addressing Text Embedding Leakage in Diffusion-Based Image Editing: This work identifies the root cause of attribute leakage in text-driven diffusion-based image editing — semantic entanglement in EOS embeddings of autoregressive text encoders — and proposes the ALE framework (ORE + RGB-CAM + BB) to comprehensively eliminate attribute leakage through embedding disentanglement, attention masking, and background blending.
Addressing Text Embedding Leakage in Diffusion-based Image Editing: This paper proposes the ALE framework, which systematically addresses attribute leakage in diffusion-based text-guided image editing through three components: Object-Restricted Embedding (ORE) to decouple semantic entanglement in EOS tokens, Region-Guided Blended Cross-Attention Masking (RGB-CAM) to constrain spatial attention, and Background Blending (BB) to preserve unedited regions. A new evaluation benchmark, ALE-Bench, is also introduced.
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation: This paper proposes ADIEE, an automated pipeline for constructing training datasets for instruction-guided image editing evaluation. A LLaVA-NeXT-8B model is fine-tuned on over 100K samples as a scorer, surpassing all open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The trained scorer can further serve as a reward model to improve image editing models.
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation: This paper proposes ADIEE, an automated pipeline for constructing a training dataset of over 100,000 samples for image editing evaluation. It fine-tunes LLaVA-NeXT-8B as an editing quality scorer and surpasses open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The resulting scorer can also serve as a reward model to improve editing model performance.
Aether: Geometric-Aware Unified World Modeling: Aether proposes a geometric-aware unified world modeling framework that jointly trains reconstruction, prediction, and planning capabilities on synthetic 4D data, built upon post-training of CogVideoX to achieve zero-shot generalization to real-world scenes.
Aether: Geometric-Aware Unified World Modeling: This paper proposes Aether, a unified world model that post-trains the CogVideoX video diffusion model on synthetic RGB-D data. Through a multi-task training strategy that randomly combines input/output modalities, Aether simultaneously achieves 4D reconstruction, action-conditioned video prediction, and goal-conditioned visual planning, with zero-shot transfer to real-world data reaching performance comparable to domain-specific models.
AIComposer: Any Style and Content Image Composition via Feature Integration: AIComposer proposes the first cross-domain image composition method that requires no text prompts. By fusing foreground and background CLIP features via an MLP network, combined with backward inversion + forward denoising and a local cross-attention strategy, the method achieves natural stylization and seamless composition without training the diffusion model, improving LPIPS and CSD metrics by 30.5% and 18.1%, respectively.
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction: This paper proposes AID, a framework that transfers a pretrained Image2Video diffusion model (SVD) to text-guided video prediction (TVP) tasks. Through MLLM-assisted video state prediction, a Dual-Query Transformer for condition injection, and spatiotemporal adapters, AID surpasses the previous state-of-the-art FVD scores by over 50% across multiple datasets.
ALE: Attribute-Leakage-free Editing for Text-based Image Editing: This paper identifies semantic entanglement in the EOS embeddings of autoregressive text encoders as the root cause of attribute leakage in text-guided image editing, and proposes the ALE framework to eliminate such leakage via three components: Object-Restricted Embedding (ORE), Region-Guided Cross-Attention Masking (RGB-CAM), and Background Blending (BB). A dedicated benchmark, ALE-Bench, is also introduced for evaluation.
Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing: This paper proposes ISLock, the first training-free image editing framework for autoregressive (AR) visual generation models. Through Anchor Token Matching (ATM), it implicitly aligns self-attention patterns in the latent space, enabling structure-consistent text-guided image editing.
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction: This paper proposes AnimeGamer, an infinite anime life simulation system built upon a multimodal large language model (MLLM). By predicting the next game state via action-aware multimodal representations — comprising dynamic animation shots and character state updates — the system achieves a continuously consistent interactive anime gaming experience.
Anti-Tamper Protection for Unauthorized Individual Image Generation: This paper proposes Anti-Tamper Perturbation (ATP), which decouples protection perturbations (preventing forged generation) and authorization perturbations (detecting purified tampering) into separate frequency-domain regions. When an attacker attempts to purify the protective signal, the anti-tamper mechanism is triggered to deny service, achieving a 100% protection success rate against various purification attacks.
AnyPortal: Zero-Shot Consistent Video Background Replacement: AnyPortal presents a zero-shot, training-free video background replacement framework that synergistically leverages IC-Light's relighting capability and the temporal prior of a video diffusion model (CogVideoX), together with a newly proposed Refinement Projection Algorithm (RPA) for pixel-level foreground preservation, running efficiently on a single 24 GB GPU.
Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!: This paper exposes the threat of "neural plagiarism"—diffusion models can readily replicate copyright-protected images (including watermarked ones). It proposes a universal attack framework based on "anchors and shims," searching for perturbations in the cross-attention mechanism to achieve coarse-to-fine semantic modification, bypassing copyright protections ranging from visible trademarks to invisible watermarks.
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts: This paper proposes APT (AutoPrompT), a black-box red-teaming framework driven by an LLM. Through an alternating optimize-finetune pipeline and a dual-evasion strategy, APT automatically generates human-readable adversarial suffixes that bypass content filters, effectively circumventing the safety mechanisms of T2I models while enabling zero-shot cross-prompt transferability.
Balanced Image Stylization with Style Matching Score: This paper proposes Style Matching Score (SMS), which recasts image stylization as a style distribution matching problem. Through progressive spectrum regularization and semantic-aware gradient refinement, SMS achieves a superior balance between style alignment and content preservation, and can be distilled into a lightweight feed-forward network for one-step stylization.
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video: This paper proposes BCD (Bitrate-Controlled Diffusion), a general self-supervised video disentanglement framework that separates per-frame motion features from global content features via a low-bitrate vector quantization information bottleneck, and reconstructs video using a conditional diffusion model. The approach demonstrates high-quality motion transfer and autoregressive video generation on talking-head and pixel-art cartoon datasets.
3DSR: Bridging Diffusion Models and 3D Representations for 3D Consistent Super-Resolution: 3DSR is proposed — an alternating iterative framework coupling diffusion-based SR with 3DGS to achieve 3D-consistent super-resolution: after each denoising step, SR images are used to train a 3DGS, yielding 3D-consistent renderings that are re-encoded into the latent space to guide the next denoising step. Without fine-tuning any model, it explicitly enforces cross-view consistency, achieving +1.16 dB PSNR and 50% FID reduction on LLFF (vs. StableSR).
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition: This paper proposes TDSM (Triplet Diffusion for Skeleton-Text Matching), the first work to apply diffusion models to zero-shot skeleton-based action recognition (ZSAR). TDSM achieves implicit alignment between skeleton features and text prompts through the reverse diffusion process, and introduces a triplet diffusion loss to enhance discriminability. It substantially outperforms state-of-the-art methods on NTU-60/120 and PKU-MMD, with improvements ranging from 2.36% to 13.05%.
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition: This paper proposes TDSM (Triplet Diffusion for Skeleton-Text Matching), the first work to apply diffusion models to zero-shot skeleton-based action recognition (ZSAR). It achieves implicit alignment between skeleton features and text prompts through the reverse diffusion process, and introduces a triplet diffusion loss to enhance discriminability. TDSM substantially outperforms state-of-the-art methods on NTU-60/120 and PKU-MMD by margins ranging from 2.36% to 13.05%.
BVINet: Unlocking Blind Video Inpainting with Zero Annotations: This paper is the first to formally define and address the task of blind video inpainting—simultaneously predicting where to restore and how to restore, end-to-end, without any annotation of corrupted regions. A mask prediction network and a video completion network mutually reinforce each other via a consistency constraint, achieving strong results on both synthetic data and real-world applications (danmaku removal and scratch repair).
CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation: This paper identifies two critical issues in diffusion-based dataset distillation — objective inconsistency and condition inconsistency — and proposes a two-stage framework, CaO2: the first stage mitigates objective inconsistency via classifier-guided sample selection, and the second stage mitigates condition inconsistency via latent space optimization to maximize conditional likelihood, achieving an average improvement of 2.3% on ImageNet.
CAP: Evaluation of Persuasive and Creative Image Generation: This paper proposes three novel evaluation metrics (creativity, alignment, and persuasiveness) for the task of advertising image generation, and leverages LLMs to expand implicit messages into explicit visual descriptions to improve T2I model performance on advertisement generation, achieving significantly higher agreement with human annotations than baselines such as CLIPScore.
CharaConsist: Fine-Grained Consistent Character Generation: This paper proposes a training-free, fine-grained consistent character generation method that achieves high-quality cross-image character consistency on a DiT architecture (FLUX.1) for the first time, via Point-Tracking Attention, adaptive token merging, and foreground-background decoupled control.
CHORDS: Diffusion Sampling Accelerator with Multi-Core Hierarchical ODE Solvers: This paper proposes CHORDS, a diffusion sampling acceleration framework based on multi-core hierarchical ODE solvers. Through a slow-to-fast inter-core rectification mechanism, CHORDS achieves 2.1×–2.9× speedup on 4–8 GPUs without sacrificing generation quality.
CHORDS: Diffusion Sampling Accelerator with Multi-Core Hierarchical ODE Solvers: This paper proposes CHORDS, a training-free and model-agnostic diffusion sampling acceleration framework based on multi-core hierarchical ODE solvers. By employing a slow-to-fast solver hierarchy and an inter-core rectification mechanism, CHORDS achieves up to 2.9× speedup across 4–8 GPU cores without sacrificing generation quality.
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts: This paper proposes CNS-Bench, the first benchmark that leverages LoRA adapters to impose continuous and photorealistic nuisance shifts on diffusion models for systematically evaluating the OOD robustness of image classifiers, covering 14 shift types, 5 severity levels, and 40+ classifiers.
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models: CoMPaSS leverages the SCOP data engine to curate spatially unambiguous training data and introduces the parameter-free TENOR module to inject token ordering information into the attention mechanism, substantially improving spatial relationship generation accuracy in T2I diffusion models (VISOR +98%, GenEval Position +131%).
CompleteMe: Reference-based Human Image Completion: This paper proposes the CompleteMe framework, which leverages a dual U-Net architecture and Region-focused Attention (RFA) Block to achieve high-fidelity reference-guided human image completion by exploiting fine-grained person-specific details (clothing textures, tattoos, etc.) from reference images.
Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal: This paper proposes CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core contribution is a Compression-aware Visual Embedder (CaVE) that extracts JPEG compression priors via an explicit–implicit dual learning strategy, guiding the diffusion model toward high-quality restoration. CODiff comprehensively outperforms existing methods on LIVE-1, Urban100, and DIV2K-Val while achieving extremely high inference efficiency.
CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation: This paper proposes CompSlider, a compositional slider model that generates conditional priors to enable simultaneous, independent, and fine-grained control over multiple attributes in T2I foundation models. It addresses inter-attribute entanglement via a disentanglement loss and a structural consistency loss.
Contrastive Flow Matching (ΔFM): A contrastive regularization term is introduced into the Flow Matching training objective to enforce separation between velocity fields of different conditions, achieving 9× training acceleration, 5× fewer sampling steps, and up to 8.9 FID reduction with zero additional inference overhead.
CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models: This work is the first to explore content-style decomposition (CSD) in visual autoregressive (VAR) models. Through three key innovations—scale-aware alternating optimization, SVD-based style embedding rectification, and augmented key-value memory—CSD-VAR achieves content preservation and style transfer quality that surpasses existing diffusion-model-based methods.
CURE: Cultural Gaps in the Long Tail of Text-to-Image Systems: This work introduces the CURE benchmark and scoring suite, which employs Marginal Information Attribution (MIA) of attribute specifications as a proxy for human judgment to systematically evaluate the representational capacity of T2I systems across the global cultural long tail.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences: Cycle consistency (reconstruction similarity via image→text→image or text→image→text) is employed as a supervision signal in lieu of human preferences to construct the 866K preference dataset CyclePrefDB. The resulting CycleReward model surpasses all existing methods on detailed caption evaluation and can improve both VLMs and diffusion models via DPO.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences: This paper proposes CycleReward, which leverages cycle consistency as a self-supervised signal to replace human preference annotations — captions are reconstructed into images via a T2I model and ranked by visual similarity, yielding the 866K preference-pair dataset CyclePrefDB. The trained reward model outperforms HPSv2/PickScore/ImageReward by 6%+ on detailed captioning, and DPO training with it improves VLM performance across multiple vision-language tasks, all without any human annotation.
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer: This paper proposes DC-AR, a masked autoregressive text-to-image generation framework built upon a Deep Compression Hybrid Tokenizer (DC-HT, 32× spatial compression). Through a hybrid pipeline of discrete token generation for structure followed by residual token refinement, DC-AR achieves state-of-the-art gFID of 5.49 on MJHQ-30K while delivering 1.5–7.9× higher throughput than diffusion models.
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing: DCT-Shield introduces adversarial perturbations in the Discrete Cosine Transform (DCT) domain rather than pixel space, making the immunization noise highly imperceptible and inherently robust to JPEG compression, thereby effectively defending against diffusion-model-based malicious image editing.
Deeply Supervised Flow-Based Generative Models: DeepFlow introduces deep supervision and a VeRA (Velocity Refiner with Acceleration) module between Transformer layers of flow-based models, aligning intermediate-layer velocity features via second-order ODE dynamics. Without relying on any external pretrained model, it achieves an 8× training speedup and significant FID improvement.
DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis: This paper proposes DeepShield, a deepfake video detection framework that combines Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). It provides patch-level supervision via spatiotemporal artifact modeling and synthesizes diverse forgery representations through distribution-level feature augmentation, significantly outperforming state-of-the-art methods in cross-dataset and cross-manipulation evaluations.
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation: Dense2MoE is the first paradigm for converting dense Diffusion Transformers (DiT) into sparse MoE structures. By replacing FFN layers with MoE layers and grouping Transformer blocks into Mixture of Blocks (MoB), combined with a multi-stage distillation pipeline, it compresses FLUX.1's 12B parameters to 5.2B activated parameters while preserving original performance, comprehensively outperforming pruning-based methods.
Dense Policy: Bidirectional Autoregressive Learning of Actions: This paper proposes Dense Policy, a robot manipulation policy based on bidirectional autoregressive expansion, which achieves hierarchical coarse-to-fine action generation in logarithmic time and surpasses mainstream generative policies such as Diffusion Policy and ACT on both simulation and real-world tasks.
Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent: This paper proposes DescriptiveEdit, which reframes "instruction-based image editing" as "text-to-image generation conditioned on a reference image." A Cross-Attentive UNet introduces attention bridge layers to inject reference image features into the generation process. With only 75M trainable parameters, the method achieves high-fidelity descriptive editing and is seamlessly compatible with community tools such as ControlNet and IP-Adapter.
DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models: This paper proposes DDIM Inversion Attack (DIA), which disrupts the image editing capability of diffusion models by directly attacking the DDIM inversion trajectory. DIA effectively defends against malicious deepfake generation and privacy-violating content synthesis, substantially outperforming existing defenses such as AdvDM and Photoguard across diverse editing methods.
DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference: DICE is a framework targeting the staleness problem in parallel inference of MoE-based diffusion models. Through three levels of optimization — step-level interweaved parallelism, layer-level selective synchronization, and token-level conditional communication — DICE achieves a 1.26× speedup on DiT-MoE with negligible quality degradation.
DiffDoctor: Diagnosing Image Diffusion Models Before Treating: This paper proposes DiffDoctor, the first method to fine-tune diffusion models using pixel-level feedback. It first trains a robust artifact detector (1M+ samples with a category-balancing strategy), then backpropagates gradients through the detector to the diffusion model by minimizing the artifact confidence of each pixel in synthesized images, achieving significant artifact reduction on unseen prompts.
DiffSim: Taming Diffusion Models for Evaluating Visual Similarity: DiffSim is the first work to discover that attention-layer features of pretrained diffusion models (Stable Diffusion) can be used to measure visual similarity. It proposes the Aligned Attention Score (AAS), which aligns features from two images in the self-attention and cross-attention layers of the U-Net and computes cosine similarity, achieving state-of-the-art performance on multiple benchmarks covering human perceptual alignment, style similarity, and instance consistency.
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching: This paper proposes training an unconditional diffusion model in the spectral domain of Functional Maps, and replacing hand-crafted axiomatic regularizers (e.g., Laplacian commutativity, orthogonality) with distilled structural priors, enabling zero-shot non-rigid shape matching across categories.
Diffusion-based 3D Hand Motion Recovery with Intuitive Physics: This paper proposes a physics-augmented conditional diffusion model that refines per-frame 3D hand reconstruction results into temporally consistent motion sequences via an iterative denoising process, incorporating intuitive physics constraints (kinematic and stability constraints) to substantially improve reconstruction accuracy and physical plausibility.
DIIP: Diffusion Image Prior: This paper discovers that pretrained diffusion models exhibit an implicit bias analogous to Deep Image Prior (DIP) when reconstructing degraded images—the iterative optimization first produces a clean image before overfitting to the degraded input—and that this bias generalizes to a broader range of degradation types than DIP. Based on this finding, the authors propose DIIP, a fully blind (degradation-model-free) image restoration method.
Discovering Divergent Representations between Text-to-Image Models: This paper proposes CompCon (Comparing Concepts), an evolutionary search algorithm that automatically discovers "divergent representations" between two text-to-image models — identifying which visual attributes differ between models and which prompt types trigger these differences — and introduces the ID² benchmark dataset for systematic evaluation.
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy: This paper proposes PaRaMS (Parameter Rearrangement & Random Multi-head Scaling), a parameter-level proactive defense method that displaces a protected model away from the shared loss basin via functionally equivalent parameter transformations, causing severe performance degradation upon merging while preserving original performance when the model is used standalone.
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers: DiTFastAttnV2 is proposed for multi-modality diffusion Transformers (MMDiT), achieving fine-grained attention compression via Head-wise Arrow Attention and Head-wise Caching mechanisms. It reduces attention FLOPs by 68% and achieves 1.5× end-to-end speedup on 2K image generation without visual quality degradation.
DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization: DMQ is a framework that combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to address outlier problems in diffusion model quantization, achieving, for the first time, stable high-quality image generation under the W4A6 low-bit setting.
Domain Generalizable Portrait Style Transfer: DGPST proposes a diffusion-based portrait style transfer framework that establishes cross-domain dense semantic correspondences via a semantic adapter to warp the reference image, employs AdaIN-Wavelet Transform for latent space initialization to balance stylization and content preservation, and generates final results through a dual-conditional diffusion model combining ControlNet (high-frequency structural guidance) and a style adapter (style guidance). The model is trained solely on 30K real portrait photographs yet generalizes to diverse domains including photos, cartoons, sketches, and anime.
DPoser-X: Diffusion Model as Robust 3D Whole-Body Human Pose Prior: This paper proposes DPoser-X, a 3D whole-body human pose prior based on an unconditional diffusion model. It unifies various pose-related tasks as inverse problems and performs test-time optimization via a truncated timestep schedule for variational diffusion sampling. A hybrid training strategy is introduced to effectively combine whole-body and part-specific datasets. DPoser-X achieves up to 61% improvement across 8 benchmarks covering body, hand, face, and whole-body modeling.
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses: DreamDance proposes a human image animation framework that takes only 2D skeleton pose sequences as input. It first generates mutually aligned depth maps and normal maps from 2D poses via a Mutually Aligned Geometry Diffusion Model to enrich 3D geometric guidance, then integrates multi-level guidance signals through an SVD-based Cross-Domain Controlled Video Diffusion Model to synthesize high-quality human animations. The method achieves state-of-the-art performance on the TikTok dataset (FVD 153.07 vs. Champ 170.20).
Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion: This paper proposes Dual Recursive Feedback (DRF), a training-free dual recursive feedback system that recursively refines intermediate latents via appearance feedback and generation feedback, addressing the insufficient structure/appearance disentanglement of controllable T2I diffusion models in class-invariant scenarios, thereby achieving fine-grained pose transfer and appearance fusion.
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability: DynamicID achieves zero-shot single/multi-identity personalized image generation through two core components — Semantic Activation Attention (SAA) and Identity-Motion Reconfigurer (IMR) — while maintaining high fidelity and flexible facial editability.
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing: This paper proposes ELECT (Early-timestep Latent Evaluation for Candidate selecTion), a zero-shot framework that selects the optimal seed by estimating background inconsistency at early denoising timesteps, reducing computational overhead by 41% (up to 61%) while improving background consistency and instruction-following quality, without requiring external supervision or additional training.
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Equivariant Flow Matching: EC-Flow introduces an "embodiment-centric flow" paradigm that predicts pixel-level motion trajectories of the robot body from action-unlabeled RGB videos, and converts visual predictions into executable actions via URDF kinematic constraints. It substantially outperforms object-centric methods in scenarios involving deformable objects, occlusion, and non-displacement manipulation.
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention: EDiT proposes a linear compressed attention mechanism that enhances query local information via ConvFusion and compresses key/value tokens via a Spatial Compressor, achieving up to 2.2× acceleration over DiT and MM-DiT architectures while maintaining comparable image quality.
EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing: This paper proposes EEdit, an efficient image editing framework that achieves an average 2.46× speedup without quality degradation across diverse editing tasks—including prompt-guided, drag-based, and image composition editing—via three components: Spatial Locality Caching (SLoC) to skip computation in unedited regions, Token Index Preprocessing (TIP) for lossless acceleration of caching operations, and Inversion Step Skipping (ISS) to reduce inversion redundancy.
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization: OAT proposes an adaptive octree tokenization scheme guided by quadric error metrics (QEM) that dynamically allocates token budgets according to local geometric complexity. It reduces token count by 50% while preserving reconstruction quality, and builds upon this foundation an autoregressive model, OctreeGPT, for high-quality text-to-3D generation.
Efficient Input-Level Backdoor Defense on Text-to-Image Synthesis via Neuron Activation Variation: NaviT2I identifies an "Early-step Activation Variation" phenomenon induced by backdoor triggers in text-to-image diffusion models, and proposes an efficient input-level backdoor defense framework that requires only the first diffusion iteration for analysis. The method achieves an average AUROC of 96.3% across 8 mainstream attacks while incurring only 3.8%–16.7% of the computational cost of existing methods.
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model: EmotiCrafter is proposed as the first emotional image generation method based on a continuous Valence-Arousal (V-A) model. It integrates V-A values into text features via an emotional embedding mapping network, which is then injected into Stable Diffusion XL to achieve precise dual control over content and emotion. The generated images significantly outperform existing methods in terms of emotional continuity and controllability.
End-to-End Multi-Modal Diffusion Mamba: This paper proposes Multi-Modal Diffusion Mamba (MDM), an end-to-end multimodal model based on the Mamba architecture. By employing a unified VAE encoder-decoder and a multi-step selective diffusion model, MDM enables simultaneous generation of images and text with computational complexity \(\mathcal{O}(MLN^2)\), surpassing existing end-to-end models on tasks including image generation, image captioning, and VQA.
Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment: This paper identifies a "scoring paradox" in CLIP/BLIP-based reward models when evaluating high-quality images — detail-rich, high-fidelity images are paradoxically assigned lower scores. The authors propose two new metrics: ICT Score (Image-Contained-Text, measuring the degree to which an image encodes the textual information) and HP Score (a purely image-modal human preference score). Training on the Pick-High dataset yields over 10% improvement in preference prediction accuracy and successfully guides SD3.5-Turbo toward generating higher-quality images.
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts: This paper systematically analyzes the unintended negative effects of concept erasure techniques on non-target concepts (spillover degradation) in text-to-image models. It proposes EraseBench, a comprehensive evaluation framework covering multiple dimensions including visual similarity, binomial association, and semantic relatedness. The findings reveal that current state-of-the-art erasure methods remain unreliable in preserving the generation quality of non-target concepts.
Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing: This paper systematically analyzes the attention mechanism of Multimodal Diffusion Transformers (MM-DiT), decomposing the attention matrix into four functional sub-blocks (I2I/T2I/I2T/T2T). Based on this analysis, it proposes an efficient prompt-based image editing method that operates by replacing image input projections (\(\mathbf{q}_i, \mathbf{k}_i\)), and is applicable to multiple MM-DiT variants including the SD3 series and Flux.1.
FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image: This paper proposes FaceCraft4D, a framework that generates animatable 360° 4D facial avatars from a single image by combining three complementary priors: a 3D shape prior (PanoHead GAN inversion), a 2D image prior (diffusion model texture enhancement), and a video prior (LivePortrait expression animation). A COIN training strategy is introduced to address multi-view data inconsistency, enabling high-quality real-time rendering at 156 FPS.
Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention: This paper proposes Entanglement-Free Attention (EFA), a training-free inference-time debiasing method that injects target attributes (e.g., gender, race) into person regions by modifying the cross-attention mechanism, while preserving non-target attributes (e.g., background, objects). EFA eliminates generative bias without introducing new unfair associations.
FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning: This paper is the first to introduce the internal representations of a pretrained text-to-image diffusion model (Stable Diffusion) into federated learning, proposing the FedDifRC framework. Through two complementary modules—Text-Driven Diffusion Contrastive Learning (TDCL) and Noise-Driven Diffusion Consistency Regularization (NDCR)—the framework effectively mitigates data heterogeneity and achieves significant performance improvements on global models across diverse non-iid settings.
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment: This paper proposes PostDiff — a training-free diffusion model acceleration framework that reduces redundancy at two levels: at the input level via a mixed-resolution denoising strategy (low resolution in early steps → high resolution in later steps), and at the module level via a hybrid caching strategy (DeepCache + cross-attention caching). The work systematically addresses the key question of whether reducing the number of denoising steps or reducing the per-step computation cost is more effective — concluding that the latter is superior across most efficiency regimes.
FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation: FICGen is proposed as the first method to address the "contextual illusion dilemma" in Layout-to-Image (L2I) generation for degraded scenes (low-light, underwater, remote sensing, adverse weather, etc.). It extracts high- and low-frequency prototypes of degraded scenes via a learnable dual-query mechanism, injects them into the latent diffusion space through visual-frequency enhanced attention, and achieves foreground-background disentanglement using instance consistency maps and spatial-frequency adaptive aggregation. FICGen comprehensively outperforms existing L2I methods across five degraded-scene datasets.
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text: Fix-CLIP enhances CLIP's long-text understanding capability through three key innovations: (1) a dual-branch training pipeline that aligns short texts with masked images and long texts with original images; (2) learnable Regional Prompts with unidirectional attention masks for local visual feature extraction; and (3) a hierarchical feature alignment module that aligns multi-scale features across intermediate encoder layers. After incremental training on 30M synthetic long-text data, Fix-CLIP substantially outperforms state-of-the-art methods on both long- and short-text retrieval benchmarks. Its text encoder can be directly plugged into diffusion models to improve long-text generation quality.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait: This paper proposes FLOAT, an audio-driven talking portrait generation method based on Flow Matching, which employs a Transformer architecture to predict vector fields in an orthogonal motion latent space. The approach enables efficient (~10-step sampling), temporally consistent, high-quality talking video generation, with additional support for speech-driven emotion enhancement and test-time head pose editing.
FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems: FlowDPS derives a Tweedie formula for flow models to decompose the Flow ODE into a clean image estimation component and a noise estimation component. Likelihood gradients are injected into the clean image component while stochastic noise is introduced into the noise component, enabling principled posterior sampling for inverse problems under the flow matching framework. FlowDPS surpasses all prior methods on four linear inverse problems using SD3.0.
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models: FlowEdit proposes an inversion-free, optimization-free, model-agnostic text-based image editing method that constructs an ODE path directly between the source and target distributions of a pre-trained flow model, achieving structure-preserving editing with lower transport cost than inversion-based approaches.
FlowTok: Flowing Seamlessly Across Text and Image Tokens: FlowTok proposes encoding both text and images as compact 1D token representations (\(77 \times 16\)) and directly evolving between text and image tokens via flow matching, eliminating the need for complex conditioning mechanisms or noise schedules, thereby enabling efficient cross-modal generation.
ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection: This paper proposes ForgeLens, a feature-guided framework built upon a frozen CLIP-ViT backbone. Through a lightweight Weight-Shared Guidance Module (WSGM) and a Forgery-Aware Feature Integrator (FAFormer), it steers the frozen pretrained network to focus on forgery-relevant features, achieving state-of-the-art generalization performance with only 1% of training data.
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency: This paper proposes Free4D, the first tuning-free framework for single-image 4D scene generation. It achieves spatial consistency via 4D geometric structure initialization and adaptive guidance denoising, temporal consistency via reference latent replacement, and integrates multi-view information into a coherent 4D Gaussian representation through modulation-based refinement, enabling real-time controllable rendering.
FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers: This paper proposes FreeCus, a completely training-free subject-driven customization framework that activates the intrinsic zero-shot subject customization capability of Diffusion Transformers (DiT) through three innovations: pivotal attention sharing, an upgraded dynamic shifting mechanism for fine-grained feature extraction, and multimodal large language model (MLLM) semantic enhancement. FreeCus achieves results comparable to or better than methods that require additional training.
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model: FreeMorph proposes the first tuning-free generalized image morphing method. Through two key designs—guidance-aware spherical interpolation and step-oriented change trend—it generates smooth transition sequences between image pairs of arbitrary semantics and layouts within 30 seconds, achieving a speed improvement of 10–50× over existing methods.
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion: This paper proposes FreeScale, a tuning-free inference paradigm that extracts and fuses information from different receptive field scales via a Scale Fusion mechanism (global high-frequency + local low-frequency), combined with tailored self-cascade upscaling and restrained dilated convolution, achieving for the first time text-to-image generation at 8K resolution on a single A800 GPU, while also supporting high-resolution video generation.
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers: This paper proposes TaylorSeer, which upgrades the feature caching paradigm for diffusion models from "cache-and-reuse" to "cache-and-forecast" — leveraging Taylor series expansion with high-order finite differences over historical features to predict intermediate features at future timesteps. TaylorSeer achieves near-lossless 4.99× acceleration on FLUX and 5.00× on HunyuanVideo, entirely without additional training.
GameFactory: Creating New Games with Generative Interactive Videos: This paper proposes GameFactory, a multi-stage training strategy that decouples game style from action control on top of a pretrained video diffusion model, enabling action control learned from small-scale Minecraft data to generalize to arbitrary open-domain scenes for interactive game video generation. This is the first method with a complete technical paper that validates scene generalization over a complex action space (7 keys + mouse).
GAP: Gaussianize Any Point Clouds with Text Guidance: This paper proposes GAP, a framework that leverages depth-aware image diffusion models to convert colorless point clouds into high-fidelity 3D Gaussian representations. A surface anchoring mechanism ensures geometric fidelity, and a diffusion-based inpainting strategy completes hard-to-observe regions.
Generating Multi-Image Synthetic Data for Text-to-Image Customization: This paper proposes SynCD (Synthetic Customization Dataset) and its generation pipeline, which synthesizes multi-image consistent object datasets using shared attention and 3D asset priors. The trained encoder model surpasses existing encoder-based methods without requiring test-time optimization.
Generative Modeling of Shape-Dependent Self-Contact Human Poses: This work constructs Goliath-SC, the first large-scale self-contact pose dataset with accurate shape annotations (383K poses / 130 subjects), proposes PAPoseDiff—a shape-conditioned part-aware latent diffusion model for modeling body-shape-dependent self-contact pose distributions—and leverages the learned diffusion prior for monocular pose refinement, outperforming SOTA methods such as BUDDI and SMPLer-X on unseen subjects.
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning: This paper proposes GenFlowRL, which integrates generative object-centric optical flow with reinforcement learning by shaping rewards using a δ-flow representation extracted from a flow generation model trained on cross-embodiment datasets. The approach enables robust and generalizable robot manipulation policy learning, significantly outperforming flow-based imitation learning and video-guided RL methods across 10 manipulation tasks.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers: This work identifies that "perfect image reconstruction does not always yield the best visual representations," and proposes GenHancer — a two-stage post-training method that uses only a lightweight randomly initialized denoiser (~1/10 the parameters of pretrained heavy denoisers) conditioned solely on the global [CLS] token. Through self-supervised reconstruction, GenHancer enhances CLIP's fine-grained visual perception, achieving a 6.0% improvement over DIVA on MMVP-VLM.
Golden Noise for Diffusion Models: A Learning Framework: This paper introduces the concept of "Noise Prompt" and proposes a lightweight Noise Prompt Network (NPNet). By collecting 100K noise pairs via Re-denoise Sampling, NPNet is trained to transform random Gaussian noise into semantically informed "golden noise," serving as a plug-and-play module to improve the generation quality of SDXL and other diffusion models with only a 3% increase in inference time.
Grouped Speculative Decoding for Autoregressive Image Generation: This paper proposes Grouped Speculative Decoding (GSD), a training-free acceleration method for autoregressive image generation that performs speculative verification at the level of semantically valid token clusters rather than the single most probable token, achieving an average speedup of 3.7× without degrading image quality.
Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction: This paper proposes Score-based Discriminator Correction (SBDC), which trains a lightweight discriminator to correct the generation trajectory of noisy-label conditional diffusion models at inference time. The discriminator is trained by partitioning the training set into clean and corrupted subsets via noise detection, and the paper finds that applying guidance only during the early-to-middle stages of the sampling process yields optimal results.
Holistic Tokenizer for Autoregressive Image Generation: This paper proposes Hita, a holistic-to-local image tokenizer that captures global attributes such as texture, material, and shape via learnable global queries, and integrates dual codebooks with a causal-attention fusion module. Without modifying the AR model architecture, Hita reduces ImageNet 256×256 generation FID to 2.59, accelerates training convergence by 2.1×, and supports zero-shot style transfer and image completion.
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning: HUB introduces the first comprehensive benchmark for evaluating concept unlearning methods in text-to-image diffusion models, covering 33 target concepts across 6 evaluation dimensions (faithfulness, alignment, pinpoint-ness, multilingual robustness, adversarial robustness, and efficiency), with 16,000 prompts per concept. The benchmark reveals that no single method achieves superiority across all dimensions.
HPSv3: Towards Wide-Spectrum Human Preference Score: HPSv3 constructs the first wide-spectrum human preference dataset HPDv3 (1.08M image-text pairs, 1.17M annotated pairs), trains a preference model using a VLM backbone (Qwen2-VL) with an uncertainty-aware ranking loss, and proposes a Chain-of-Human-Preference (CoHP) iterative generation method, significantly improving the accuracy and coverage of image generation evaluation.
HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation: This work combines the hierarchical representation learning capacity of hyperbolic space with the high-quality generative capability of diffusion autoencoders. By manipulating the radius and direction of latent codes within the Poincaré disk, it achieves controllable, diverse, and class-consistent few-shot image generation.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance: This paper proposes ILLUME, a unified MLLM that integrates multimodal understanding and generation capabilities into a single LLM via a unified next-token prediction paradigm. Through a semantic visual tokenizer (reducing pretraining data by 4× to 15M) and a self-enhancement multimodal alignment scheme (enabling the model to self-evaluate the consistency between its generated images and text), ILLUME achieves competitive or superior performance compared to state-of-the-art unified models across diverse understanding, generation, and editing tasks.
ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization: This paper introduces ImageGem, the first large-scale in-the-wild generative interaction dataset (57K users × 242K customized LoRAs × 3M text prompts × 5M generated images), enabling three applications via individual user preference annotations: aggregate preference alignment surpassing Pick-a-Pic, personalized retrieval and generative recommendation (with significant VLM reranking gains), and the first proposed generative model personalization—learning preference editing directions in the LoRA latent weight space (W2W) to customize diffusion models.
Improved Noise Schedule for Diffusion Training: This paper proposes a unified framework for analyzing and designing noise schedules in diffusion models from a probability distribution perspective. It finds that a Laplace noise schedule—which concentrates sampling probability near \(\log\text{SNR}=0\) (the signal–noise transition point)—improves FID by 26.6% over the standard cosine schedule under the same training budget, outperforming all loss-weighting adjustment methods.
Inference-Time Diffusion Model Distillation: This paper proposes Distillation++, an inference-time diffusion distillation framework that leverages a pretrained teacher model during the student model's sampling process to correct its denoising trajectory, significantly narrowing the teacher–student performance gap without requiring additional training data or fine-tuning.
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis: This paper proposes InfGen, a "second-generation" paradigm that replaces the VAE decoder with a Transformer-based generator, decoding fixed-size latents into images at arbitrary resolution in a single forward pass—without modifying or retraining the diffusion model. It reduces 4K image generation to under 10 seconds, achieving over 10× speedup compared to the fastest existing method, UltraPixel.
InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation: InfiniDreamer leverages a pretrained short-sequence motion diffusion model as a prior and proposes Segment Score Distillation (SSD), an optimization method that iteratively refines overlapping short segments within a coarsely initialized long motion sequence, enabling arbitrarily long human motion generation without requiring additional long-sequence training data.
Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping: This paper proposes Inpaint4Drag, which decomposes drag-based image editing into two stages: pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation, the proposed bidirectional warping algorithm enables real-time preview (0.01s) and efficient generation (0.3s), achieving a 600× speedup over existing methods while serving as a universal adapter for arbitrary inpainting models.
IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features: This paper proposes IntroStyle, a training-free style attribution method that leverages only channel-wise mean and variance statistics from intermediate layers of a diffusion model's own denoising network, measuring inter-image style similarity via the 2-Wasserstein distance. IntroStyle substantially outperforms supervised state-of-the-art methods on WikiArt and DomainNet without any task-specific training.
Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design: This paper proposes Water4MU, a framework that integrates digital watermarking with machine unlearning (MU) via bi-level optimization (BLO). The upper level optimizes the watermark network to facilitate unlearning, while the lower level performs the unlearning optimization, thereby substantially improving unlearning effectiveness without significantly compromising model utility.
IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark: This paper proposes IRGPT, the first multimodal large language model grounded in real-world infrared images. It introduces IR-TD, a large-scale infrared-text dataset containing 260K+ image-text pairs, and designs a Bi-cross-modal Curriculum transfer learning strategy. IRGPT achieves state-of-the-art performance across 9 infrared task benchmarks, with zero-shot psum outperforming the baseline InternVL2-8B by 76.35.
Joint Diffusion Models in Continual Learning: This paper proposes JDCL, which unifies a classifier and a diffusion generative model into a single jointly parameterized network. Combined with knowledge distillation and a two-stage training strategy, JDCL substantially alleviates catastrophic forgetting in generative replay-based continual learning, surpassing existing generative replay methods.
LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering: This paper proposes LaRender, a training-free image generation method grounded in volume rendering principles. It precisely controls inter-object occlusion relationships by "rendering" object features in latent space. The method replaces only the cross-attention layers of a pretrained diffusion model without introducing any learnable parameters, significantly outperforming existing SOTA methods in occlusion accuracy while enabling rich effects such as semantic transparency control.
Latent Diffusion Models with Masked AutoEncoders: This paper systematically analyzes three key properties that autoencoders in LDMs should possess (latent space smoothness, perceptual compression quality, and reconstruction quality), identifies that existing autoencoders fail to satisfy all three simultaneously, and proposes Variational Masked AutoEncoders (VMAEs). By combining MAE's hierarchical features with VAE's probabilistic encoding, VMAEs achieve significant improvements in generation quality (ImageNet-1K gFID: 5.98 vs. 6.49 for SD-VAE) using only 13.4% of the parameters and 4.1% of the GFLOPs.
LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization: LATINO-PRO is the first work to embed Latent Consistency Models (LCMs) as generative priors within a zero-shot inverse problem solving framework, achieving state-of-the-art reconstruction quality with only 8 neural function evaluations (NFEs), and further improving performance via empirical Bayes-based automatic text prompt calibration.
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers: This paper proposes LayouSyn, an open-vocabulary text-to-layout generation pipeline that extracts scene elements via a lightweight open-source language model and generates layouts using an aspect-ratio-aware diffusion Transformer, achieving state-of-the-art performance on spatial and quantity reasoning benchmarks.
LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching: LazyMAR addresses the inference efficiency bottleneck of Masked Autoregressive (MAR) models by exploiting two types of redundancy: token redundancy (most token features are highly similar across adjacent decoding steps) and condition redundancy (the residual between conditional and unconditional outputs in classifier-free guidance changes minimally between adjacent steps). Based on these observations, the paper proposes token cache and condition cache mechanisms, achieving a 2.83× speedup with negligible loss in generation quality.
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling: LD-RPS proposes a zero-shot, dataset-free unified image restoration method that performs recurrent posterior sampling via a pretrained latent diffusion model. It leverages multimodal large language models for semantic priors and a learnable F-PAM module to align the degradation domain, achieving high-quality blind restoration across diverse degradation types.
Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model: TP-Diff is the first work to introduce diffusion models into unpaired image deblurring. It learns spatially varying texture priors via a memory-augmented Texture Prior Encoder (TPE), and designs a Filter-Modulated Multi-head Self-Attention (FM-MSA) to leverage these priors for precise deblurring, achieving a new unsupervised state-of-the-art on multiple benchmarks with only 11.89M parameters.
Learning Few-Step Diffusion Models by Trajectory Distribution Matching: This paper proposes Trajectory Distribution Matching (TDM), a novel paradigm that unifies trajectory distillation and distribution matching by aligning the marginal distributions of student and teacher ODE trajectories at the distributional level. TDM enables efficient few-step diffusion model distillation, requiring only 2 A800 GPU-hours to distill PixArt-α into a 4-step generator that surpasses the teacher model.
Learning to See in the Extremely Dark: This paper proposes a paired-to-paired data synthesis pipeline to construct SIED, a RAW image enhancement dataset for extremely dark scenes (down to 0.0001 lux), and designs a diffusion-model-based framework that achieves high-quality restoration of ultra-low-SNR RAW images via an Adaptive Illumination Correction Module (AICM) and a color consistency loss.
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation: This paper proposes UNO, a universal DiT-based customized generation model. Through a "model-data co-evolution" paradigm—wherein synthetic data generated by weaker models progressively trains stronger models—combined with progressive cross-modal alignment and Universal RoPE, UNO achieves state-of-the-art performance on both single- and multi-subject-driven image generation (DreamBench DINO 0.760, CLIP-I 0.835).
Less is More: Improving Motion Diffusion Models with Sparse Keyframes: This paper proposes sMDM, a motion diffusion framework centered on sparse keyframes. By introducing a masking-interpolation strategy and the Visvalingam-Whyatt keyframe selection algorithm, sMDM reduces redundant frame processing and consistently outperforms dense-frame baselines in text alignment and motion quality.
LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding: LIFT proposes a meta-learning-based multi-scale implicit neural representation framework that achieves unified encoding across tasks (generation, classification) and data modalities (2D images, 3D voxels) via parallel local implicit functions and a hierarchical latent generator, attaining state-of-the-art performance on both reconstruction and generation tasks while substantially reducing computational cost.
LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation: This paper systematically investigates how to safely and efficiently convert a pretrained DiT into a linear attention variant called LiT. It proposes five practical guidelines—depth-wise convolution augmentation, fewer heads, weight inheritance, selective parameter loading, and hybrid distillation—achieving comparable performance with only 20% of DiT's training steps.
Long-Context State-Space Video World Models: This paper proposes integrating State Space Models (SSM/Mamba) into video world models. Through a block-wise SSM scan scheme that balances spatial consistency and temporal memory, combined with local frame attention, the method achieves persistent long-term spatial memory under linear training complexity and constant inference overhead, substantially outperforming finite-context Transformer baselines on Memory Maze and Minecraft datasets.
Looking in the Mirror: A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models: This paper treats a classifier's decision boundary as a "mirror" and generates counterfactual explanations (CFEs) by reflecting feature representations to the other side of the mirror. A triangulation loss is designed to preserve distance relationships between the latent space and image space, yielding faithful, controllable, and animatable counterfactual explanations.
LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models: This paper formulates the retrieval of relevant and diverse LoRA combinations from a library of 100K+ adapters as a combinatorial optimization problem. It proposes LoRAverse, a framework based on submodular function maximization, which achieves relevance- and diversity-aware LoRA selection through concept extraction followed by submodular retrieval.
LUSD: Localized Update Score Distillation for Text-Guided Image Editing: LUSD addresses the failure modes of existing score distillation methods in image editing (particularly object insertion) through two simple modifications—attention-based spatial regularization and gradient filtering-normalization—which resolve instabilities caused by large disparities in gradient magnitude and spatial distribution, achieving a better balance between prompt fidelity and background preservation.
M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization: This paper proposes M2SFormer, which unifies multi-spectral (2D DCT frequency-domain) and multi-scale (SIFT-style spatial pyramid) attention mechanisms within encoder-decoder skip connections, and introduces an edge-aware curvature-based difficulty-guided attention decoder. The method achieves state-of-the-art cross-domain generalization in image forgery localization (average unseen-domain DSC 43.0% and mIoU 34.3% under the CASIAv2 training protocol).
Make Me Happier: Evoking Emotions Through Image Diffusion Models: EmoEditor presents the first systematic emotion-driven image generation framework, employing a dual-branch diffusion model (global emotion conditioning + local semantic features) to generate target-emotion images from only a source image and a target emotion label — without manual text instructions or reference images. The work also introduces the EmoPair dataset containing 340K emotion-annotated image pairs.
MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence: This paper proposes MamTiff-CAD, a framework that combines a Mamba+-based encoder with a Transformer decoder in an autoencoder to learn latent representations of CAD command sequences, followed by a multi-scale Transformer diffusion model for generation. It is the first method to generate complex CAD models with sequence lengths of 60–256 commands.
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis: MaskControl is the first work to introduce spatial controllability into generative masked motion models. It manipulates the logits of the token classifier via two core components — Logits Regularizer (implicit alignment during training) and Logits Optimization (explicit optimization during inference) — simultaneously achieving high-quality motion generation (FID reduced by 77%) and high-precision joint control (average error 0.91 cm vs. 1.08 cm).
MatchDiffusion: Training-free Generation of Match-Cuts: MatchDiffusion is proposed as a training-free two-stage pipeline that exploits the property of diffusion models—whereby early denoising steps establish the macroscopic scene structure and late steps add semantic details—to automatically generate match-cut video pairs via Joint Diffusion and Disjoint Diffusion.
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation: This paper proposes MAVFlow, a zero-shot audio-visual renderer based on conditional flow matching (CFM), which leverages dual-modal guidance from audio speaker embeddings and visual emotion embeddings to preserve speaker consistency in multilingual AV2AV translation.
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts: This paper proposes a Meta-Unlearning framework for diffusion models that augments standard unlearning objectives with a meta-objective, causing benign knowledge associated with unlearned concepts to self-destruct upon malicious fine-tuning, thereby preventing relearning of erased concepts. The framework is compatible with most existing unlearning methods and requires only the addition of a simple meta-objective.
Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching: This paper identifies an "alignment gap" in vision foundation models (e.g., DINOv2) for image feature matching: contrastive learning-based models discard instance-level details and lack cross-image interaction mechanisms, causing failures in multi-instance matching scenarios. To address this, the authors propose the IMD framework, which employs diffusion models as feature extractors to preserve instance-level details, and designs a Cross-Image Interaction Prompt Module (CIPM) for bidirectional information exchange. IMD achieves state-of-the-art performance on standard benchmarks and on the newly introduced multi-instance benchmark IMIM, with a 12% improvement in multi-instance scenarios.
MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance: MMAIF proposes a unified multi-task, multi-degradation, language-guided image fusion framework that operates in latent space via a realistic degradation pipeline and a modernized DiT architecture. It offers both a regression and a Flow Matching variant, surpassing existing restoration+fusion pipelines across diverse degraded fusion tasks.
MoFRR: Mixture of Diffusion Models for Face Retouching Restoration: This paper introduces the Face Retouching Restoration (FRR) task for the first time and proposes the MoFRR framework—inspired by DeepSeek MoE—which activates retouching-type-specific experts (Wavelet DDIM) and a shared expert (general DDIM) via a router, achieving near-authentic restoration of retouched faces on the newly constructed million-scale RetouchingFFHQ++ dataset.
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics: This paper proposes MosaicDiff, a training-free structural pruning method for diffusion models that dynamically partitions the inference trajectory into three stages based on pretraining learning-speed dynamics and applies stage-specific subnetworks with varying sparsity, achieving significant acceleration on DiT and SDXL without sacrificing generation quality.
MotionDiff: Training-Free Zero-Shot Interactive Motion Editing via Flow-Assisted Multi-View Diffusion: MotionDiff proposes a training-free, zero-shot multi-view motion editing approach that estimates multi-view optical flow from static scenes via a Point Kinematics Model (PKM), and leverages a decoupled motion representation to guide Stable Diffusion in generating high-quality, multi-view-consistent motion editing results.
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space: This paper proposes MotionStreamer, which integrates a continuous causal latent space with a diffusion head into an autoregressive framework for text-conditioned streaming human motion generation, supporting online multi-turn generation and dynamic motion composition.
Multi-turn Consistent Image Editing: This paper proposes a multi-turn image editing framework based on flow matching. By incorporating dual-objective LQR guidance and an adaptive attention mechanism, it effectively suppresses error accumulation across editing rounds, enabling flexible and controllable iterative editing while maintaining content consistency.
Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation: This paper proposes SewingLDM, a multimodal conditional latent diffusion model that generates complex sewing patterns under text, sketch, and body-shape conditions via an extended sewing pattern representation and a two-stage training strategy, with seamless integration into CG simulation pipelines.
MUNBa: Machine Unlearning via Nash Bargaining: This work formulates Machine Unlearning (MU) as a two-player cooperative bargaining game and derives a closed-form solution via Nash bargaining theory to simultaneously address gradient conflict and gradient dominance between the forgetting and retention objectives, achieving an optimal balance between unlearning and preservation across both classification and generation tasks.
Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling: This paper introduces the SoulDance dataset (the first high-quality 3D dance dataset encompassing body, hand, and facial motion) and the SoulNet framework (hierarchical residual vector quantization + music-aligned generative model + cross-modal retrieval), achieving the first whole-body 3D dance generation with coordinated facial expressions, body, and hand movements aligned to musical rhythm and emotion.
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes: NuiScene proposes an efficient vector set encoding scheme for scene chunks, paired with an explicitly trained outpainting diffusion model, to enable fast unbounded outdoor scene generation. The work also curates NuiScene43, a high-quality outdoor scene dataset.
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping: This paper proposes NullSwap, which embeds identity-guided invisible perturbations into source images to cloak facial identity information, preventing Deepfake face-swapping models from extracting the correct identity, thereby enabling proactive defense against face-swapping attacks in a purely black-box setting.
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis: Omegance proposes scaling the noise prediction in each denoising step of a diffusion model by a single parameter \(\omega\), enabling training-free global, spatial, and temporal control over the detail granularity of generated images and videos. The method is architecture-agnostic and compatible with SDXL, SD3, FLUX, and other models.
OminiControl: Minimal and Universal Control for Diffusion Transformer: OminiControl is proposed to unify spatially aligned and non-aligned image control tasks on the DiT architecture with only 0.1% additional parameters. Core innovations include unified sequence processing, dynamic positional encoding, and an attention bias control mechanism.
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting: This paper proposes OmniPaint, a unified framework that reformulates object removal and insertion as mutually inverse and complementary tasks. Built upon the FLUX diffusion prior, it introduces the CycleFlow unpaired training mechanism and the CFD reference-free evaluation metric. With only 3K real paired samples, OmniPaint achieves high-fidelity object editing, excelling particularly at complex physical effects such as shadows and reflections.
OmniVTON: Training-Free Universal Virtual Try-On: OmniVTON proposes the first training-free universal virtual try-on framework. By decoupling garment texture and pose conditions, the method employs three core modules—Structured Garment Morphing (SGM), Continuous Boundary Stitching (CBS), and Spectral Pose Injection (SPI)—to achieve high-fidelity try-on in both in-shop and in-the-wild settings, while also supporting multi-person try-on for the first time.
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering: This paper presents Ouroboros, a unified framework comprising two single-step diffusion models (for inverse rendering RGB→X and forward rendering X→RGB respectively) that are jointly trained with cycle-consistency to enforce bidirectional rendering coherence. The method achieves state-of-the-art performance across multiple datasets while running 50× faster than multi-step diffusion baselines, and can be applied to video decomposition in a training-free manner.
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs: This paper proposes PanoLlama, which extends fixed-size visual autoregressive (VAR) models to endless panorama generation via a token redirection strategy, enabling training-free next-crop prediction that surpasses joint diffusion methods in coherence, fidelity, and aesthetics.
PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution: This paper proposes PatchScaler, a patch-level independent diffusion super-resolution pipeline that employs a Global Restoration Module to generate confidence maps quantifying per-region reconstruction difficulty, partitions patches into easy/medium/hard groups with different sampling step budgets, and incorporates a texture prompt retrieval mechanism — achieving superior quality on RealSR at only 0.23× the inference time of ResShift.
Penalizing Boundary Activation for Object Completeness in Diffusion Models: This paper investigates the root cause of incomplete object generation in diffusion models — the RandomCrop data augmentation used during training — and proposes a training-free boundary activation penalty method. By applying cross-attention and self-attention constraints during early denoising steps, the method suppresses object generation near image boundaries, reducing the object incompleteness rate of SDv2.1 from 45.7% to 17.3%.
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation: This paper proposes PersonalVideo, a framework that applies hybrid reward supervision—comprising an Identity Consistency Reward (ICR) and a Semantic Consistency Reward (SCR)—directly to generated videos. This approach eliminates the distribution gap between T2I fine-tuning and T2V inference inherent in conventional methods, achieving high identity fidelity while preventing degradation of motion dynamics and semantic alignment.
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups: This paper proposes Person-Interaction Noise Optimization (PINO), a training-free framework that decomposes complex multi-person group interactions into semantically well-defined dyadic interaction pairs. By leveraging a pretrained two-person interaction diffusion model with noise optimization and physical penalty terms, PINO sequentially synthesizes group interaction motions of arbitrary scale, supporting fine-grained user control and long-duration motion generation.
PLA: Prompt Learning Attack against Text-to-Image Generative Models: This paper proposes PLA (Prompt Learning Attack), a gradient-driven adversarial attack framework targeting black-box T2I models. By leveraging sensitive knowledge encoding and multimodal similarity losses, PLA learns adversarial prompts that bypass both prompt filters and post-hoc safety checkers, achieving an average ASR-4 exceeding 90%, substantially outperforming existing methods.
Pretrained Reversible Generation as Unsupervised Visual Representation Learning: PRG extracts unsupervised visual representations by inverting the generation process of pretrained continuous generative models (diffusion/flow models), enabling model-agnostic adaptation to discriminative tasks. It achieves 78% top-1 accuracy on ImageNet 64×64, establishing state of the art among generative model-based methods.
Randomized Autoregressive Visual Generation: This paper proposes Randomized AutoRegressive modeling (RAR): during standard autoregressive training, the input sequence is randomly permuted and gradually annealed back to raster-scan order, enabling the model to learn bidirectional context. RAR achieves a state-of-the-art FID of 1.48 on ImageNet-256 for autoregressive image generation while remaining fully compatible with the language model framework.
REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents: This paper proposes Reducio-VAE, a content-frame-conditioned 3D video autoencoder that compresses video into a motion latent space 64× smaller than a standard 2D VAE. Paired with Reducio-DiT, it generates 16-frame 1024×1024 videos in 15.5 seconds on a single A100 GPU, with training requiring only 3,200 A100 GPU hours.
ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation: To address the challenge of real image editing in Rectified Flow (ReFlow) models, this paper systematically analyzes intermediate representations in MM-DiT, identifies three key features (I2I-SA, I2T-CA, and residual features), and proposes mid-step feature extraction along with two attention adaptation techniques. The resulting training-free, user-mask-free method achieves high-quality real image editing on the FLUX model, attaining a 68.2% human preference rate that substantially outperforms competing approaches.
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder: REGEN replaces the conventional VAE decoder with a Diffusion Transformer (DiT) as a re-generative decoder for video, breaking the temporal compression bottleneck through a "generation rather than exact reconstruction" learning paradigm and achieving up to 32× temporal compression.
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers: This paper proposes REPA-E, which enables joint end-to-end training of VAE and latent diffusion Transformers via representation alignment (REPA) loss, achieving 17× and 45× training speedups over REPA and vanilla training respectively, and setting a new state of the art of FID 1.12 on ImageNet 256×256.
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers: This paper proposes REPA-E, the first training framework to successfully enable end-to-end joint tuning of a VAE and a latent diffusion model. By updating the VAE via the REPA alignment loss rather than the diffusion loss, REPA-E achieves a 17–45× training speedup and sets a new state of the art on ImageNet 256 (FID 1.12).
Rethink Sparse Signals for Pose-guided Text-to-Image Generation: This paper proposes SP-Ctrl (Spatial-Pose ControlNet), which replaces the fixed RGB encoding of OpenPose with learnable Spatial-Pose Representations (SPR) and introduces a Keypoint Concept Learning (KCL) strategy that leverages cross-attention heatmap constraints to improve keypoint alignment. The method enables sparse pose signals to achieve pose control accuracy comparable to dense signals (depth maps / DensePose), while preserving image diversity and cross-species generation capability.
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers: This paper identifies two structural issues in MM-DiT architectures (FLUX, SD3.5): the token count asymmetry between visual and text modalities suppresses cross-modal attention, and attention weights are insensitive to timestep. To address these, the authors propose TACA (Temperature-Adjusted Cross-modal Attention), which rebalances multimodal interaction via temperature scaling and timestep-adaptive adjustment. Combined with LoRA fine-tuning, TACA achieves significant improvements in text-image alignment on T2I-CompBench (spatial relations +16.4%, shape +5.9%) with negligible additional computational overhead.
Rethinking Layered Graphic Design Generation with a Top-Down Approach: This paper proposes Accordion, a top-down framework that converts AI-generated rasterized design images into editable layered designs (comprising background, foreground object, and vectorized text layers), where a VLM plays distinct roles across three stages: reference creation, design planning, and layer generation.
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities: This paper proposes VLN-PE, the first physically realistic vision-and-language navigation platform supporting humanoid, quadruped, and wheeled robots. It systematically evaluates existing VLN methods under real physical constraints, revealing a 34% drop in success rate when transferring from simulation to physical deployment.
Revelio: Interpreting and Leveraging Semantic Information in Diffusion Models: Revelio employs k-sparse autoencoders (k-SAE) to uncover monosemantic, interpretable features encoded across different layers and timesteps of diffusion models, and validates the transfer learning utility of these features via a lightweight classifier, Diff-C, enabling a systematic interpretation of black-box diffusion models.
SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer: This paper proposes SA-LUT, which achieves spatially adaptive photorealistic style transfer via a style-guided 4D look-up table and a context map generated by content-style cross-attention. On the newly introduced PST50 benchmark, SA-LUT reduces LPIPS by 66.7% compared to 3D LUT methods while supporting real-time 4K video processing at 16 FPS.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation: SANA-Sprint proposes a hybrid distillation framework combining continuous-time consistency models (sCM) and latent adversarial diffusion distillation (LADD). It converts pretrained Flow Matching models to TrigFlow in a lossless manner and jointly trains with sCM+LADD, achieving unified adaptive high-quality text-to-image generation in 1–4 steps, with a single-step latency of only 0.1 seconds on H100.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation: This work converts a pretrained SANA flow matching model into TrigFlow via a lossless mathematical transformation, and combines continuous-time consistency distillation (sCM) with latent adversarial diffusion distillation (LADD) in a hybrid training strategy, achieving unified 1–4 step adaptive high-quality image generation. One-step generation of 1024×1024 images requires only 0.1s on an H100, surpassing FLUX-schnell with an FID of 7.59 and GenEval of 0.74 while being 10× faster.
SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models: SCFlow is proposed to learn an invertible merging mapping between style and content via Flow Matching, leveraging the invertibility of the mapping to allow disentanglement to emerge naturally as an implicit property of the merging process, without requiring explicit disentanglement supervision.
ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion: ScoreHOI employs a score-based diffusion model as an optimizer, integrating DDIM inversion–forward sampling with physical constraints (contact, penetration, ground contact) to guide the denoising process. Combined with a contact-driven iterative refinement strategy, it achieves physically plausible 3D reconstruction of human-object interactions from monocular images, improving contact F-Score by 9% on BEHAVE.
SDMatte: Grafting Diffusion Models for Interactive Matting: This paper proposes SDMatte, a Stable Diffusion-based interactive matting model that converts the text interaction capability of diffusion models into visual prompt interaction capability via three key designs: visual prompt cross-attention, coordinate/opacity embeddings, and mask self-attention. SDMatte significantly outperforms SAM-based methods across multiple datasets.
Semantic Discrepancy-aware Detector for Image Forgery Identification: This paper proposes the Semantic Discrepancy-aware Detector (SDD), which leverages three modules — Semantic Token Sampling (STS), Concept-level Forgery Discrepancy Learning (CFDL), and a Low-level Forgery Feature Enhancer — to align CLIP's visual semantic concept space with the forgery space via reconstruction learning. SDD achieves state-of-the-art performance on the UnivFD and SynRIS benchmarks (\(ap_m\) 98.51%, AUROC 95.1%).
Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity: This paper addresses the frequency integrity loss caused by discarding the imaginary part in existing semantic watermarking methods for latent diffusion models (LDMs). It proposes Hermitian Symmetric Fourier Watermarking (SFW) and a center-aware embedding strategy to preserve frequency-domain integrity while enhancing detection robustness and generation quality.
ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning: ShortFT is proposed to construct denoising shortcuts using trajectory-preserving few-step diffusion models, substantially compressing the original lengthy denoising chain to enable complete end-to-end reward gradient backpropagation, achieving efficient and effective alignment of diffusion models with reward functions.
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models: SliderSpace applies PCA to CLIP features of images generated by a diffusion model under a given prompt, automatically discovering multiple semantically orthogonal controllable directions. Each direction is trained as a LoRA adapter (slider), enabling concept decomposition, artistic style exploration, and diversity enhancement without any manually specified attributes.
SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models: This paper proposes SMGDiff, a two-stage diffusion model framework that generates high-quality, diverse soccer motion animations in real time from user control signals, while refining ball-foot interaction details via a contact guidance module.
Spectral Image Tokenizer: This paper proposes the Spectral Image Tokenizer (SIT), which tokenizes images in the frequency domain after converting them via the Discrete Wavelet Transform (DWT). The resulting token sequence is naturally arranged in a coarse-to-fine order, enabling capabilities unavailable to conventional raster-scan tokenizers, including multi-resolution reconstruction, progressive generation, text-guided super-resolution, and image editing.
Straighten Viscous Rectified Flow via Noise Optimization: This paper proposes VRFNO (Viscous Rectified Flow via Noise Optimization), which enhances trajectory distinguishability by introducing a historical velocity term and jointly trains an encoder to optimize noise for constructing optimal couplings, effectively straightening the inference trajectories of Rectified Flow. VRFNO achieves state-of-the-art one-step/few-step generation performance on CIFAR-10 and AFHQ (one-step FID of 4.50, without distillation).
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation: StreamDiffusion proposes a pipeline-level real-time diffusion framework that achieves up to 91 fps on a single RTX 4090—59.6× faster than Diffusers AutoPipeline—through Stream Batch (batched denoising steps), R-CFG (residual classifier-free guidance), and SSF (stochastic similarity filtering).
Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal: This paper formulates portrait shadow removal as a diffusion inpainting problem. It trains an illumination-invariant structure extraction network to obtain structure maps free of shadow boundaries, uses these maps to guide an inpainting diffusion model for shadow region restoration, and applies a gradient-guided detail recovery diffusion model to reconstruct fine facial details. The proposed method substantially outperforms existing approaches on benchmark datasets.
StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance: This paper proposes Negative Visual Query Guidance (NVQG), a training-free method that suppresses content leakage by injecting the reference image's queries as a negative guidance signal in self-attention layers. The approach achieves high-quality visual style prompting and outperforms existing methods in both style similarity and text alignment.
StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion: This paper proposes StyleMotif, a single-branch motion latent diffusion framework that unifies content generation and multi-modal style injection (text/image/video/audio/motion) via a style-content cross normalization mechanism. Compared to SMooDi's dual-branch design, StyleMotif reduces trainable parameters by 43.9% and improves inference speed by 22.5%, while achieving a 5.23% gain in Style Recognition Accuracy (SRA).
SummDiff: Generative Modeling of Video Summarization with Diffusion: SummDiff is the first work to introduce diffusion models into video summarization, formulating the task as a conditional generation problem. By learning the distribution of "good summaries," the model generates diverse plausible summaries that better reflect the inherent subjectivity of the video summarization task.
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing: SuperEdit addresses the noisy supervision problem in instruction-based image editing by leveraging diffusion generation priors to guide a VLM in rectifying editing instructions, and by constructing contrastive supervision signals (positive/negative instructions + triplet loss), surpassing SmartEdit by 9.19% with less data and a smaller model.
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection: This paper proposes SynOOD, which synthesizes challenging near-boundary OOD samples by combining MLLM-based contextual semantic extraction, iterative diffusion inpainting, and OOD gradient guidance. The synthesized samples are used to fine-tune the CLIP image encoder and negative label features, achieving a 2.80% AUROC improvement and 11.13% FPR95 reduction on the ImageNet benchmark.
TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation: TaxaDiffusion leverages the hierarchical structure of biological taxonomy (Kingdom→Phylum→Class→Order→Family→Genus→Species) to progressively train a diffusion model, gradually refining from high-level shared characteristics to species-level subtle distinctions. This approach achieves high-precision fine-grained animal image generation, reducing FID to 31.87 on the FishNet dataset (vs. 43.91 for LoRA), improving BioCLIP alignment score by 37%, and remaining effective for rare species with very few samples (even as few as 1).
TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance: This paper proposes TeEFusion, which encodes the guidance magnitude of CFG directly as a linear combination of conditional and unconditional text embeddings to replace dual forward passes, achieving efficient CFG distillation with zero additional parameters. The method is compatible with complex teacher sampling strategies (e.g., Z-Sampling, W2SD), enabling a 6× inference speedup over the teacher model.
TeRA: Rethinking Text-guided Realistic 3D Avatar Generation: TeRA is proposed as the first text-guided 3D realistic avatar generation framework based on a latent diffusion model. By distilling a large-scale human reconstruction model to construct a structured latent space, TeRA generates realistic 3D human avatars in 12 seconds—two orders of magnitude faster than SDS-based methods.
Text Embedding Knows How to Quantize Text-Guided Diffusion Models: This paper is the first to leverage text prompts to guide dynamic bit-width allocation for diffusion model quantization — by predicting the quality of images generated from a given text prompt, it adaptively selects high/medium/low bit precision for different layers and timesteps, reducing computational complexity while maintaining or even improving generation quality.
The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation: This paper identifies the "curse of conditions" in conditional flow matching — a training-test mismatch caused by standard optimal transport (OT) ignoring conditioning information, which induces a conditionally skewed prior during training while an unbiased prior is used at test time. The authors propose C²OT (Conditional Optimal Transport), which resolves this issue by incorporating a condition-weighted term into the OT cost matrix.
The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation: This paper proposes NoiseQuery, a training-free T2I generation enhancement method that pre-constructs a large-scale noise library and retrieves the initial noise best matching the user's goal at inference time, enabling fine-grained control over both high-level semantics and low-level visual attributes, with only 0.002s/prompt additional overhead, improving performance across multiple T2I models and enhancement techniques.
Timestep-Aware Diffusion Model for Extreme Image Rescaling: This paper proposes TADM, which performs extreme image rescaling (16×/32×) in the latent space of a pretrained Stable Diffusion model. By introducing a Decoupled Feature Rescaling Module (DFRM) and a timestep-aware alignment strategy, TADM dynamically allocates the generative capacity of the diffusion model to handle spatially non-uniform degradation.
TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation: This paper proposes TLB-VFI, an efficient video diffusion model for frame interpolation. It employs a temporal-aware autoencoder—comprising a latent-space temporal block and a pixel-space 3D wavelet gating mechanism—to extract rich temporal information, combined with a redesigned Brownian bridge diffusion process. With only 46.7M parameters (3× fewer than image diffusion methods and 20× fewer than video diffusion methods), TLB-VFI achieves approximately 20% FID improvement on SNU-FILM extreme and Xiph-4K benchmarks.
Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification: This paper proposes AntiPure, an adversarial perturbation method that directly attacks the diffusion-based purification process through two guidance mechanisms—Patch-wise Frequency Guidance (PFG) and Erroneous Timestep Guidance (ETG)—to generate protective perturbations that continue to disrupt customization fine-tuning even after purification, outperforming all existing protection methods under the purification-customization (P-C) workflow.
Trade-offs in Image Generation: How Do Different Dimensions Interact?: This paper proposes TRIG-Bench, a benchmark comprising 40,200 samples across 10 evaluation dimensions and 132 pairwise dimension subsets, along with a VLM-as-Judge metric termed TRIGScore. It is the first work to systematically reveal and analyze trade-offs among evaluation dimensions (e.g., realism, relation alignment, style) in image generation models, and leverages a Dimension Trade-off Map (DTM) to guide fine-tuning for performance improvement.
Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting: This paper proposes Trans-Adapter, a plug-and-play adapter module that enables diffusion-based image inpainting models to directly process transparent (RGBA) images. It also introduces the LayerBench benchmark and the Alpha Edge Quality (AEQ) metric.
Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models: This paper proposes TLoRA, which decomposes the fine-tuning of pretrained weights into a Transform adaptation and a Residual adaptation, parameterized respectively via Tensor Ring Matrix (TRM) and Tensor Ring (TR) decompositions. On SDXL, TLoRA achieves highly parameter-efficient fine-tuning with only 0.4M parameters while outperforming LoRA and other baselines.
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models: This paper proposes TRCE, a two-stage concept erasure strategy—textual semantic erasure followed by denoising trajectory steering—that reliably removes malicious concepts while minimizing degradation of the model's general generation capability.
Understanding Flatness in Generative Models: Its Role and Benefits: This paper presents the first systematic study of loss landscape flatness in generative models, particularly diffusion models. It theoretically demonstrates that flat minima enhance robustness to perturbations in the prior distribution, and empirically shows that SAM effectively promotes flatness in diffusion models, leading to improved generation quality, reduced exposure bias, and greater quantization robustness.
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer: UniCombine proposes a DiT-based multi-condition controllable generation framework that achieves unified generation under arbitrary condition combinations (text + spatial map + subject image) via a Conditional MMDiT Attention mechanism and a LoRA Switching module. It supports both training-free and training-based modes, and introduces SubjectSpatial200K, the first dataset for multi-condition generation.
Unlocking the Potential of Diffusion Priors in Blind Face Restoration: This paper proposes FLIPNET, a unified framework built upon a T2I diffusion model that switches between a restoration mode (BoostHub selectively fuses LQ features + BFR-oriented facial embeddings) and a degradation mode (learns from real degradation datasets and synthesizes degraded images) by simply flipping the inputs, simultaneously addressing two key challenges: the HQ/LQ distribution gap and the synthetic/real degradation gap.
Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching: DDM4IP proposes an unsupervised framework that models the degradation distribution via Conditional Flow Matching, while simultaneously learning an unknown forward degradation model through a distribution matching loss. Using only a small number of unpaired images, the method achieves competitive or superior performance on deblurring, spatially-varying PSF calibration, and blind super-resolution tasks.
Video Color Grading via Look-Up Table Generation: This paper proposes a video color grading framework that explicitly generates Look-Up Tables (LUTs) via a diffusion model. A GS-Extractor captures high-level style features from a reference scene, and an L-Diffuser generates a color LUT that can be applied losslessly to all video frames in a single forward pass. Text prompts are further supported for fine-grained adjustments such as brightness and contrast.
Video Motion Graphs: Video Motion Graphs proposes a retrieval-augmented generation system for human motion video synthesis. It constructs a motion graph from reference videos and performs conditioned path search to obtain keyframes, then employs HMInterp—a dual-branch diffusion-based frame interpolation model combining skeleton guidance from a Motion Diffusion Model and progressive condition training—to seamlessly connect discontinuous frames. The system supports multiple conditioning signals (music, speech, action labels) and significantly outperforms both generative and retrieval-based baselines in human motion video quality.
VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition Dataset: This paper proposes VIGFace, a framework that pre-allocates virtual prototypes orthogonal to real identities in the feature space of a face recognition (FR) model, and trains a diffusion model to generate face images conditioned on these prototypes—producing identities that do not exist in the real world, thereby enabling privacy-free face recognition dataset construction and data augmentation.
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning: This paper presents VisualCloze, which unifies diverse image generation tasks under a "visual cloze" paradigm—defining tasks via visual in-context examples rather than text instructions, performing unified generation through an image infilling model, and constructing the Graph200K graph-structured dataset to enhance cross-task knowledge transfer. The framework supports in-distribution tasks, unseen-task generalization, multi-task composition, and reverse generation.
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?: Through systematic analysis of the behavior of \(W_{\{q,k,v,o\}}\) components during LoRA fine-tuning, this work reveals that \(W_v\) and \(W_o\) are responsible for learning panoramic spherical structure while \(W_q\) and \(W_k\) retain perspective-domain shared knowledge. Based on this finding, the paper proposes UniPano, an efficient single-branch panorama generation framework.
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization: This paper systematically analyzes the domain separation capacity of latent spaces from six pretrained models (CLIP, DiT, SD, MAE, DINOv2, ResNet) and demonstrates that diffusion model features are most effective at separating domain information in an unsupervised setting. Building on this insight, the authors propose GUIDE — a framework that leverages diffusion features to discover pseudo-domain representations and augment classifier features — achieving 66.3% average accuracy across five DomainBed datasets without domain labels (surpassing the ERM baseline by +2.6% and +4.3% on TerraIncognita), while outperforming most methods that require domain labels.
Your Text Encoder Can Be An Object-Level Watermarking Controller: By fine-tuning only the pseudo-token embedding \(\mathcal{W}_*\) in the text encoder, this work achieves object-level invisible watermark embedding in T2I diffusion model-generated images, attaining 99% bit accuracy (48 bits) with \(10^5\times\) fewer parameters.