🎨 Image Generation¶

📷 CVPR2026 · 239 paper notes

2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching: This paper proposes 2ndMatch, a fine-tuning framework for pruned diffusion models that aligns the second-order Jacobian matrix \(J^\top J\) between the pruned and original models—inspired by finite-time Lyapunov exponents (FTLE)—to match their sensitivity to input perturbations over time, thereby significantly closing the generation quality gap.
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective: This paper proposes D2C (Diffusion Dataset Condensation)—the first dataset condensation framework for diffusion models—which achieves 100–233× training speedup while maintaining high-quality image generation by using only 0.8–8% of ImageNet data through a two-stage "Select + Attach" pipeline.
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation: This paper proposes the ADAPT framework, which employs three training-free modules — Attention-driven adaptive Prompt Scheduling (APS), Pooling Embedding Manipulation (PEM), and Latent Space Manipulation (LSM) — to deterministically and semantically control the generation transition from common to rare concepts, significantly outperforming the R2F baseline on RareBench.
HINGE: Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images: HINGE is a framework that, for the first time, repurposes a pre-trained expression-space single-cell foundation model (sc-FM, CellFM) as a histology-image-conditioned spatial gene expression generator. It achieves state-of-the-art performance on three ST datasets while maintaining superior gene co-expression consistency, through three core mechanisms: identity-initialized SoftAdaLN for lightweight visual context injection, an expression-space masked diffusion process that aligns with the pre-training objective, and a warm-start curriculum to stabilize training.
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation: This paper proposes Adaptive Auxiliary Prompt Blending (AAPB), which derives a closed-form adaptive blending coefficient via the Tweedie formula to dynamically balance the contributions of an auxiliary anchor prompt and a target prompt at each denoising step. Without any training, AAPB significantly improves semantic accuracy and structural fidelity for both rare concept generation and zero-shot image editing.
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration: Ours proposes Spectrum, a global spectral domain feature forecasting method based on Chebyshev polynomials. By treating the intermediate features of the diffusion denoiser as functions of time and fitting coefficients via ridge regression, it achieves long-range feature forecasting where errors do not grow with step size. It reaches 4.79× acceleration on FLUX.1 and 4.67× on Wan2.1-14B with near-lossless quality.
Agentic Retoucher for Text-To-Image Generation: The problem of correcting local distortions (deformed fingers, facial abnormalities, text errors, etc.) in T2I diffusion model outputs is modeled as a Perception-Reasoning-Action multi-agent cyclic system named Agentic Retoucher. It utilizes a Perception Agent to locate defects via context-aware distortion saliency maps, a Reasoning Agent to diagnose distortion types through structured reasoning, and an Action Agent to execute repairs via tool selection. Combined with the GenBlemish-27K dataset, it achieves end-to-end iterative automatic correction.
Agentic Retoucher for Text-To-Image Generation: Agentic Retoucher reframes the local defect restoration of T2I generated images into a multi-agent closed-loop decision process of Perception \(\to\) Reasoning \(\to\) Action. Through context-aware saliency detection, human-preference-aligned diagnostic reasoning, and adaptive tool selection, it achieves autonomous restoration, improving plausibility by 2.89 points on GenBlemish-27K, with 83.2% of results rated better than the original by humans.
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations: AHS overcomes the limitations of self-supervised training by using a head reenactment model (GAGAvatar) to generate synthetic augmented data. Combined with a dual-encoder attention mechanism and an adaptive masking strategy, it achieves SOTA results in head swapping tasks for full-body images.
AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution: AlignVAR addresses two consistency failures of visual autoregressive (VAR) models in image super-resolution (ISR): spatially incoherent reconstructions caused by locally biased attention, and cross-scale error accumulation induced by residual supervision. The proposed framework introduces Spatial Consistency Autoregression (SCA) and Hierarchical Consistency Constraint (HCC) to jointly resolve both issues, achieving reconstruction quality superior to diffusion-based methods while delivering over 10× faster inference.
All-in-One Slider for Attribute Manipulation in Diffusion Models: The proposed All-in-One Slider framework trains a lightweight Attribute Sparse Autoencoder on the intermediate layer embeddings of a text encoder. It decomposes attributes into disentangled directions within a high-dimensional sparse activation space, enabling continuous, fine-grained, and composable control of multiple facial attributes with a single module. It also demonstrates zero-shot continuous manipulation capabilities for unseen attributes (e.g., ethnicity, celebrities).
All-in-One Slider for Attribute Manipulation in Diffusion Models: The All-in-One Slider framework is proposed, which trains an Attribute Sparse Autoencoder on the text embedding space to decouple various facial attributes into sparse semantic directions. This enables a single lightweight module to achieve fine-grained continuous control over 52+ attributes, supporting multi-attribute combinations and zero-shot manipulation of unseen attributes.
Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling: The Ani3DHuman framework is proposed, combining kinematics-driven mesh animation with video diffusion priors. Through Self-guided Stochastic Sampling, it restores low-quality rigid body renderings into high-fidelity videos, achieving realistic modeling of non-rigid clothing dynamics.
APPLE: Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping: APPLE proposes a teacher-student framework based on diffusion models. It trains a teacher model using conditional deblurring (instead of traditional conditional inpainting) to generate attribute-aligned pseudo-labels, which are then used to train a student model. This achieves SOTA performance in attribute preservation (FID 2.18, Pose Error 1.85) while maintaining identity transfer capabilities.
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation: Ar2Can decomposes multi-human image generation into two stages — spatial planning (Architect) and identity-preserving rendering (Artist) — and trains the Artist model via GRPO reinforcement learning with a spatially-anchored face reward based on Hungarian matching. The method achieves an identity preservation score of 68.2 and a counting accuracy of 90.2 on MultiHuman-Testbench, substantially outperforming all baselines.
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys: The authors propose AS-Bridge, a bidirectional generative framework based on the Brownian Bridge diffusion process. It models the probabilistic conditional distribution between ground-based LSST and space-based Euclid astronomical surveys, enabling cross-survey image translation and rare event detection (gravitational lensing), while improving likelihood estimation of the standard Brownian Bridge via an \(\epsilon\)-prediction training objective.
AS-Bridge: A Bidirectional Generative Framework Bridging Next-Generation Astronomical Surveys: AS-Bridge is proposed to model the conditional probability distribution between ground-based LSST and space-based Euclid survey observations using a bidirectional Brownian Bridge diffusion process, enabling cross-survey probabilistic image translation and unsupervised strong gravitational lens detection by leveraging reconstruction inconsistency.
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models: This paper employs linear probing to demonstrate that implicit decisions in diffusion models—such as defaulting to male when gender is unspecified—are primarily governed by self-attention layers rather than cross-attention layers. Building on this finding, the paper proposes ICM, a method that intervenes on a small number of critical self-attention layers to achieve state-of-the-art debiasing while minimizing image quality degradation.
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution: This paper reframes AI-generated image attribution from a classification paradigm to an instance retrieval paradigm, proposing the LIDA framework. It extracts generator-specific fingerprints from RGB low-bit planes as input, and achieves open-set attribution via unsupervised pre-training on real images followed by few-shot adaptation. LIDA achieves average Rank-1 accuracies of 40.4%/77.5% on GenImage and WildFake under the 1-shot setting, substantially outperforming existing methods.
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution: This paper proposes LIDA, which reformulates AI-generated image attribution from a classification problem into a retrieval problem. By leveraging low-bit-plane fingerprints to capture generator-specific artifacts, combined with unsupervised pre-training and few-shot adaptation, LIDA achieves state-of-the-art Deepfake detection and image attribution under zero-shot and few-shot settings.
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models: Proposes AutoDebias—the first unified framework to simultaneously detect and mitigate malicious backdoor biases in T2I models. It leverages VLM open-set detection to discover trigger-bias associations and construct look-up tables, then eliminates backdoor associations through CLIP-guided distribution alignment training. It reduces the attack success rate from 90% to nearly 0 across 17 backdoor scenarios while maintaining image quality.
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro: Banana100 systematically investigates quality degradation in multi-turn editing by having Nano Banana Pro iteratively replicate images 100 times, constructing a dataset of 28,000 degraded images. The study reveals a startling finding: 21 mainstream No-Reference Image Quality Assessment (NR-IQA) metrics fail to reliably detect iterative degradation—most metrics even assign higher scores to noisy images than to clean ones.
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling: This paper proposes BeautyGRPO, a reinforcement learning-based face retouching framework that constructs a fine-grained preference dataset FRPref-10K to train a dedicated reward model, and introduces a Dynamic Path Guidance (DPG) mechanism to balance stochastic exploration and high fidelity, achieving natural retouching results aligned with human aesthetic preferences.
Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training: This paper identifies a "Motion-Vision Quality Dilemma" where motion quality (MQ) and visual quality (VQ) are negatively correlated in video data. Through gradient analysis, it reveals that imbalanced data can produce equivalent learning signals at appropriate timesteps. The proposed TQD framework enables training on imbalanced data to surpass training on "golden data."
BiGain: Unified Token Compression for Joint Generation and Classification: BiGain proposes a frequency-aware token compression framework. Through Laplacian-gated token merging (preserving high-frequency details) and interpolate-extrapolate KV downsampling (preserving query precision), it is the first to simultaneously optimize generation quality and classification accuracy in diffusion model inference acceleration.
BiGain: Unified Token Compression for Joint Generation and Classification: BiGain proposes a frequency-aware token compression framework comprising two training-free operators: Laplacian-Gated Token Merging and Interpolation-Extrapolation KV Downsampling. It is the first to maintain generation quality while significantly improving discriminative classification performance in diffusion model acceleration.
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation: BiMotion is proposed to compress variable-length motion sequences into a fixed number of control points using continuously differentiable B-spline curves. Combined with a specialized VAE and a flow-matching diffusion model, it achieves fast, highly expressive, and semantically complete text-guided dynamic 3D character generation, outperforming existing methods in both quality and efficiency.
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment: This paper proposes the BioVITA framework, comprising a million-scale tri-modal (image–text–audio) biological dataset, a two-stage alignment model, and a six-direction cross-modal species-level retrieval benchmark, achieving for the first time unified visual-textual-acoustic representation learning in the biological domain.
BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation: This paper proposes BlackMirror, a two-stage framework that achieves generalizable black-box backdoor detection against T2I models through fine-grained instruction-response semantic deviation detection (MirrorMatch) and cross-prompt stability verification (MirrorVerify). The framework achieves an average F1 of 89.46%, substantially outperforming the existing black-box method UFID.
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing: Proposes CARE-Edit, a condition-aware expert routing framework that implements dynamic computation allocation on a DiT backbone via heterogeneous experts (Text/Mask/Reference/Base) coupled with a lightweight latent-attention router. This effectively addresses issues like color bleeding and identity drift caused by conflicting multi-conditional signals (text, mask, reference image) in unified image editors.
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion: Proposes CaReFlow, the first work to utilize rectified flow for multimodal distribution mapping to bridge the modality gap: it enables source modality data points to observe the global distribution of the target modality through one-to-many mapping, applies different alignment intensities to modality pairs with varying correlation via adaptive relaxed alignment, and ensures no information loss after mapping through cyclic rectified flow. It achieves SOTA on multiple multimodal affective computing benchmarks even with simple concatenation fusion.
Causal Motion Diffusion Models for Autoregressive Motion Generation: This paper proposes CMDM, a framework that unifies diffusion denoising and autoregressive generation within a motion-language-aligned causal latent space. By employing frame-wise independent noise and a causal uncertainty-based sampling schedule, CMDM achieves high-quality, low-latency text-to-motion generation and long-sequence streaming synthesis.
Guiding Diffusion Models with Semantically Degraded Conditions (CDG): Condition-Degradation Guidance (CDG) replaces the null prompt \(\emptyset\) in CFG with a semantically degraded condition \(\boldsymbol{c}_{\text{deg}}\), transforming the guidance from a "good vs. empty" comparison to a refined "good vs. almost good" contrast. This significantly improves the compositional generation precision of diffusion models without requiring any training.
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance: This paper reinterprets Classifier-Free Guidance (CFG) as a feedback control process within flow matching diffusion models, proposes a unified framework termed CFG-Ctrl, and introduces SMC-CFG — a nonlinear feedback guidance mechanism grounded in sliding mode control (SMC) — which substantially improves semantic consistency and generation robustness at large guidance scales.
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing: Proposes ChangeBridge, which achieves conditional spatiotemporal image generation from pre-event to post-event in remote sensing scenes via a drift-asynchronous diffusion bridge. It supports multimodal controls including coordinate-text, semantic masks, and instance layouts, serving as a data generation engine for change detection tasks.
ChordEdit: One-Step Low-Energy Transport for Image Editing: Based on dynamic optimal transport theory, a low-energy Chord control field is derived to smooth unstable naive editing fields, achieving the first training-free, inversion-free, and high-fidelity real-time image editing for distilled one-step T2I models.
Cinematic Audio Source Separation Using Visual Cues: This paper proposes the first audio-visual cinematic audio source separation (AV-CASS) framework, which leverages visual cues from dual video streams (face and scene) to perform generative three-way audio separation (dialogue/effects/music) via conditional flow matching, training solely on synthetic data while generalizing to real films.
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers: Internal circuit mechanisms for spatial relation generation in Diffusion Transformers (DiT) are revealed through mechanistic interpretability: Randomized Token Embedding (RTE) models utilize a two-stage modular circuit (Relation Heads + Object Generation Heads), while T5-encoded models fused relation information into object tokens for single-token decoding, showing significant differences in robustness.
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers: Through mechanistic interpretability methods, this work reveals two distinct circuit mechanisms for spatial relation generation in Diffusion Transformers: Randomized Text Encoders (RTE) use a two-stage modular circuit with "relation heads + object heads," while T5 encoders integrate relation information into object tokens for single-token decoding, making the latter more fragile under out-of-distribution perturbations.
CoD: A Diffusion Foundation Model for Image Compression: This paper proposes CoD, the first diffusion foundation model designed for image compression. Trained from scratch for joint compression-generation optimization, CoD replaces Stable Diffusion in downstream diffusion codecs and achieves state-of-the-art performance at ultra-low bitrates (0.0039 bpp), with a training cost of only 0.3% of that required by SD.
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation: Ours proposes coDrawAgents, an interactive multi-agent dialogue framework (Interpreter-Planner-Checker-Painter). It significantly enhances the faithfulness of compositional text-to-image generation in complex scenarios through divide-and-conquer incremental layout planning, visual context-driven spatial reasoning, and an explicit error correction mechanism.
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation: This paper proposes coDrawAgents, an interactive multi-agent dialogue framework in which four specialized agents — Interpreter, Planner, Checker, and Painter — collaborate in a closed loop. A divide-and-conquer strategy incrementally plans layouts group by group according to semantic priority, grounding reasoning in canvas visual context with explicit error correction. The framework achieves an Overall Score of 0.94 on GenEval, substantially outperforming GPT Image 1 (0.84), and reaches 85.17 SOTA on DPG-Bench.
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment: This paper proposes CognitionCapturerPro, which integrates EEG signals with four modalities (image, text, depth, and edge) via Uncertainty-Weighted Masking (UM), a multi-modal fusion encoder, and Shared-Trunk Multi-Head Alignment (STH-Align). On THINGS-EEG, the method achieves a Top-1 retrieval accuracy of 61.2% and Top-5 of 90.8%, improving over the predecessor CognitionCapturer by 25.9% and 10.6%, respectively.
CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment: CognitionCapturerPro addresses Fidelity Loss via uncertainty-weighted masking and resolves Representational Shift by integrating image, text, depth, and edge information through a multi-modal fusion encoder. Combined with a lightweight shared backbone alignment replacing diffusion priors, it improves Top-1/Top-5 retrieval accuracy on the THINGS-EEG dataset by 25.9% and 10.6%, respectively.
CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation: Introduces CoLoGen, a unified image generation framework based on "Concept-Localization Duality." Through progressive staged training and the Progressive Representation Weaving (PRW) dynamic expert routing architecture, it simultaneously matches or exceeds specialized models across three major tasks: instruction editing, controllable generation, and personalized generation.
ConsistCompose: Unified Multimodal Layout Control for Image Composition: The paper proposes ConsistCompose, which achieves layout-controllable multi-instance image generation within a unified multimodal framework by embedding layout coordinates directly into language prompts (the LELG paradigm). It constructs the ConsistCompose3M dataset with 3.4 million samples providing layout and identity supervision. Coupled with a Coordinate-aware CFG mechanism, it achieves a 7.2% improvement in layout IoU and a 13.7% improvement in AP on COCO-Position while maintaining general understanding capabilities.
ConsistCompose: Unified Multimodal Layout Control for Image Composition: Propounds the LELG (Language-Embedded Layout Guidance) paradigm, which encodes bounding box coordinates directly into text tokens within the language stream. This achieves layout-controllable multi-instance image generation in a unified multimodal Transformer without requiring any specialized layout encoders or branches.
COT-FM: Cluster-wise Optimal Transport Flow Matching: This paper proposes COT-FM, a plug-and-play Flow Matching enhancement framework that clusters target samples, inverts a pretrained model to recover cluster-wise source distributions, and approximates optimal transport within each cluster. This significantly straightens transport trajectories, simultaneously accelerating sampling and improving generation quality without modifying the model architecture.
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think: CRAFT proposes an ultra-lightweight alignment method for diffusion models: it automatically constructs high-quality training sets through a Compositional Reward Filtering (CRF) strategy and then performs an enhanced version of SFT. Theoretically, CRAFT optimizes the lower bound of Group Relative Policy Optimization (GRPO). It outperforms SOTA methods requiring thousands of preference pairs using only 100 samples, with training speeds 11-220 times faster.
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video: This paper proposes C-MET (Cross-Modal Emotion Transfer), which models the mapping of emotion semantic vectors between speech and facial expression spaces, achieving for the first time speech-driven talking face video generation for extended emotions (e.g., sarcasm, charisma), surpassing the state of the art in emotion accuracy by 14%.
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration: This paper proposes CTCal (Cross-Timestep Self-Calibration), which leverages reliable text-image alignments (cross-attention maps) formed at small timesteps (low noise) to calibrate representation learning at large timesteps (high noise), providing explicit cross-timestep self-supervision for text-to-image generation. CTCal comprehensively outperforms existing methods on T2I-CompBench++ and GenEval.
Cycle-Consistent Tuning for Layered Image Decomposition: A cycle-consistent fine-tuning framework based on diffusion models is proposed to achieve image layer separation (e.g., logo-object decomposition) by jointly training decomposition and synthesis models. A progressive self-improving data augmentation strategy is introduced to achieve robust decomposition in scenarios with non-linear layer interactions.
D2C: Accelerating Diffusion Model Training under Minimal Budgets via Condensation: This work introduces dataset condensation to diffusion model training for the first time, proposing the D2C two-stage framework (Select+Attach). Using only 0.8% of ImageNet data, it achieves an FID of 4.3 in 40K steps, performing 100x faster than REPA and 233x faster than vanilla SiT.
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment: This paper proposes Detail-Aligned VAE (DA-VAE), which introduces structured latent representations (base + detail channels) with an alignment loss to achieve a 4× compression ratio increase over pretrained VAEs without retraining diffusion models from scratch, requiring only 5 H100-days to adapt SD3.5 for 1024×1024 image generation.
Elucidating the SNR-t Bias of Diffusion Probabilistic Models: This paper reveals the pervasive SNR-t bias in diffusion models (the mismatch between the Signal-to-Noise Ratio of samples in the reverse process and their timestamps) and proposes Differential Correction in Wavelet domain (DCW). DCW is a training-free, plug-and-play method that enhances the generation quality across various diffusion models.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation: DeCo proposes a frequency-decoupled pixel diffusion framework that delegates high-frequency detail synthesis to a lightweight pixel decoder while allowing the DiT to focus on low-frequency semantic modeling. Combined with a frequency-aware flow matching loss, it achieves FID 1.62 (256×256) and 2.22 (512×512) on ImageNet, substantially narrowing the gap between pixel diffusion and latent diffusion models.
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache: Diffusion model sampling acceleration is formulated as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) to quantify the path dependency of skipping errors and using dynamic programming to select the optimal sequence of key steps, DPCache achieves a 4.87× speedup on FLUX, surpassing the full-step baseline by +0.028 ImageReward.
Depth Adaptive Efficient Visual Autoregressive Modeling: Reveals the fundamental limitations of the frequency-driven hard pruning paradigm in VAR models and proposes DepthVAR, a training-free inference acceleration framework. By adaptively allocating the Transformer computation depth for each token (rather than binary keep/prune), it achieves \(2.3\times\)-\(3.1\times\) speedup with minimal quality loss.
Diffusion Mental Averages: Proposed Diffusion Mental Averages (DMA), which extracts "mental average" prototype images of concepts from pretrained diffusion models by aligning multiple denoising trajectories in semantic space—achieving consistent and realistic concept averaging visualization for the first time.
Diffusion Probe: Generated Image Result Prediction Using CNN Probes: This work discovers that the cross-attention distribution in early denoising steps of diffusion models is highly correlated with final image quality. It proposes Diffusion Probe — a lightweight CNN that predicts generation quality from early attention maps — enabling pre-filtering of low-quality generation trajectories after only 10% of denoising steps, thereby accelerating prompt optimization, seed selection, and GRPO training.
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization: Ours proposes DiFlowDubber, an automated video dubbing framework based on Discrete Flow Matching (DFM). Through a two-stage training pipeline (Zero-shot TTS pre-training → Video dubbing adaptation), large-scale TTS knowledge is transferred to video-driven dubbing. The framework features a FaPro module to capture facial expression-prosody mapping and a Synchronizer module for precise lip-sync.
DiP: Taming Diffusion Models in Pixel Space: The paper proposes DiP, an efficient pixel-space diffusion framework. By utilizing a DiT backbone to model global structures on large patches combined with a lightweight Patch Detailer Head to recover local details, it achieves computational efficiency comparable to LDMs without requiring a VAE, reaching a 1.79 FID on ImageNet 256×256.
Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation: This paper proposes the DisCo framework, which resolves the similarity-controllability paradox in subject-driven image generation by first decoupling textual and visual information (replacing entity words with pronouns to eliminate textual interference on the subject) and then re-coupling them via GRPO with a dedicated reward model.
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression: Ours proposes DiT-IC, which adapts a pre-trained T2I Diffusion Transformer into a one-step image compression reconstruction model via three alignment mechanisms (Variance-Guided Reconstruction Flow, Self-Distillation Alignment, and Latent Conditional Guidance). By performing diffusion in a deep latent space with \(32\times\) downsampling, it achieves SOTA perceptual quality with decoding speeds \(30\times\) faster than existing diffusion-based codecs.
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression: This work adapts a pretrained text-to-image DiT (SANA) into an efficient single-step image compression decoder. Three alignment mechanisms are proposed: variance-guided reconstruction flow (pixel-level adaptive denoising intensity), self-distillation alignment (encoder latents as distillation targets), and latent-conditioned guidance (replacing the text encoder). Operating entirely in a deep latent space with 32× downsampling, the method achieves state-of-the-art perceptual quality (BD-rate DISTS −87.88%), decodes 30× faster than prior diffusion-based methods, and can reconstruct 2K images on a 16 GB laptop GPU.
Diversity over Uniformity: Rethinking Representation in Generated Image Detection: This paper proposes an Anti-Feature-Collapse Learning (AFCL) framework that filters task-irrelevant features via an information bottleneck and suppresses excessive overlap among heterogeneous forgery cues, thereby preserving diversity and complementarity in discriminative representations. The method achieves significant improvements in cross-model generated image detection.
DMin: Scalable Training Data Influence Estimation for Diffusion Models: Proposes DMin, a scalable training data influence estimation framework for diffusion models. By using an efficient gradient compression pipeline, it reduces storage requirements from hundreds of terabytes down to MB/KB levels, enabling influence estimation for billion-parameter diffusion models for the first time and supporting sub-second top-k retrieval.
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache: This paper formalizes diffusion model sampling acceleration as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) and applying dynamic programming to select the optimal sequence of key timesteps, the method achieves training-free 4.87× acceleration while surpassing the full-step baseline in generation quality.
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning: DreamVideo-Omni is proposed as a two-stage progressive training paradigm—omni-motion identity supervised fine-tuning followed by latent identity reward feedback learning—that, within a single DiT architecture, for the first time unifies multi-subject customization with full-granularity motion control (global bounding boxes + local trajectories + camera motion).
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning: This paper proposes DreamVideo-Omni, a unified DiT framework for multi-subject identity customization and omni-motion control (global bbox + local trajectory + camera motion). It resolves multi-subject ambiguity via condition-aware 3D RoPE and Group/Role Embeddings, and introduces Latent Identity Reward Feedback Learning (LIReFL) to provide dense identity rewards at arbitrary denoising timesteps, enabling efficient identity reinforcement by bypassing the VAE decoder.
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution: The paper proposes DUO-VSR, a three-stage distillation framework that compresses multi-step video super-resolution models into a one-step generator through progressive guided distillation initialization, dual-stream distillation (joint optimization of DMD and RFS-GAN), and preference-guided refinement. It achieves approximately 50× acceleration while surpassing the visual quality of previous one-step VSR methods.
DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data: DynaVid proposes utilizing synthetic optical flow rendered via computer graphics (rather than synthetic videos) to train video diffusion models. Through a two-stage framework consisting of a motion generator and a motion-guided video generator, it achieves realistic video synthesis of highly dynamic motions and fine-grained camera control.
EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation: EdgeDiT proposes a hardware-aware optimization framework for Diffusion Transformers that trains lightweight proxy blocks via hierarchical knowledge distillation and searches for Pareto-optimal architectures through multi-objective Bayesian optimization, achieving 20–30% parameter reduction, 36–46% FLOPs reduction, and 1.65× on-device speedup while maintaining or surpassing the generation quality of the original DiT-XL/2.
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking: This paper provides a unified theoretical and experimental analysis of how non-adversarial diffusion editing inadvertently destroys robust invisible watermarks, deriving bounds for watermark SNR and mutual information decay, and validating systemic failures of watermark recovery across scenarios such as instruction-based editing, drag-based editing, and training-free synthesis.
Editing Away the Evidence: Diffusion-Based Image Manipulation and the Failure Modes of Robust Watermarking: This paper systematically analyzes, from both theoretical (SNR attenuation, mutual information lower bounds, denoising contraction) and empirical perspectives, how non-adversarial diffusion editing (instruction-based, drag-based, and composition-based) inadvertently destroys robust invisible watermarks, revealing that traditional post-processing robustness does not generalize to generative transformations.
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing: This paper proposes EffectErase, a framework that jointly learns video object insertion as an inverse auxiliary task to object removal, and constructs a large-scale VOR dataset containing 60K video pairs, enabling high-quality erasure of objects along with their associated visual side effects, including occlusion, shadow, reflection, illumination changes, and deformation.
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation: EgoFlow proposes a generative framework based on Flow Matching that integrates multimodal scene conditions through a Mamba-Transformer-Perceiver hybrid architecture. During inference, it applies differentiable physical constraints (collision avoidance, motion smoothness) via gradient-guided sampling to generate physically plausible 6DoF object motion trajectories from first-person videos, reducing collision rates by up to 79%.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation: This work is the first to extend the MeanFlow framework from class-label conditioning to text-conditioned image generation. It identifies the semantic discriminability and disentanglement of text representations as the key bottlenecks under limited inference steps, and achieves high-quality few-step/one-step T2I generation based on the BLIP3o-NEXT text encoder.
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories: The EMMA benchmark is proposed to systematically evaluate concept erasure methods for T2I models across five dimensions (erasing ability, retaining ability, efficiency, quality, and bias) with 12 metrics. Covering 206 concept categories across 5 domains, it reveals for the first time the shallow erasure nature and bias amplification issues of existing methods under implicit prompts.
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception: Ours proposes the DIAE framework, which transforms vague aesthetic instructions into joint signals of HSV/contour maps and text via Multimodal Aesthetic Perception (MAP). It leverages an "imperfectly paired" dataset, IIAEData, to achieve weakly supervised image aesthetic enhancement.
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception: DIAE proposes a Multimodal Aesthetic Perception (MAP) module to convert vague aesthetic instructions into explicit control signals (HSV + contour maps + text). It constructs a "imperfectly paired" dataset, IIAEData, and utilizes a dual-branch supervision framework for weakly supervised training, achieving content-consistent aesthetic enhancement with a 17.4% improvement in LAION aesthetic scores.
Enhancing Spatial Understanding in Image Generation via Reward Modeling: The authors construct the SpatialReward-Dataset, an 80K adversarial preference dataset, to train SpatialScore—a reward model specifically for evaluating spatial relationship accuracy (outperforming GPT-5). By integrating a top-k filtering strategy with GRPO online RL, they significantly enhance the spatial generation capabilities of FLUX.1-dev.
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models: This paper systematically evaluates the trade-off between safety (erasure success rate) and compositional generation capability across 16 text-to-image diffusion model unlearning methods. It reveals that aggressive erasure strategies, while removing undesirable content, severely damage attribute binding, spatial reasoning, and counting abilities, emphasizing that safety interventions should not come at the expense of the model's semantic logic.
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation: This paper proposes EVATok, a four-stage framework that first uses a proxy tokenizer to estimate the optimal token allocation for each video, then trains a lightweight router to predict these allocations in a single forward pass, and finally trains an adaptive tokenizer that flexibly assigns token counts according to content complexity. On UCF-101, EVATok achieves state-of-the-art generation quality with a 24.4% reduction in token count.
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation: This paper proposes EVATok, a four-stage framework that defines optimal token allocation via a proxy reward, trains a lightweight router to predict the optimal token budget for each video segment, and achieves content-adaptive variable-length video tokenization. EVATok attains state-of-the-art generation quality on UCF-101 while saving at least 24.4% of tokens.
Exploring Conditions for Diffusion Models in Robotic Control: This paper investigates how to leverage the conditioning mechanisms of pretrained text-to-image diffusion models to generate task-adaptive visual representations for robotic control. It identifies that text conditioning fails in control environments due to domain gap, and proposes ORCA, a framework that employs learnable task prompts and per-frame visual prompts as conditioning signals. ORCA achieves state-of-the-art performance across 12 tasks on three benchmarks: DMC, MetaWorld, and Adroit.
ExpPortrait: Expressive Portrait Generation via Personalized Representation: This paper proposes a high-fidelity personalized head representation (static identity offset + dynamic expression offset) to address the limited expressiveness of parametric models such as SMPL-X. Combined with an identity-adaptive expression transfer module and a DiT-based generator, the method achieves state-of-the-art performance on both self-driven portrait video animation and cross-identity reenactment tasks.
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop: This paper presents ExpressEdit, a fully open-source Photoshop plugin that achieves noise-free editing of stylized facial expressions within 3 seconds on a single consumer-grade GPU, leveraging a SPICE-based diffusion model backend combined with a Danbooru expression tag database and a RAG system, significantly outperforming commercial models such as GPT, Grok, and Nano Banana 2.
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration: Face2Scene proposes a two-stage framework: a reference-based face restoration model (Ref-FR) first produces HQ-LQ face pairs from which degradation codes are extracted as an "oracle"; these codes then condition a single-step diffusion model to restore the full scene, including body and background.
FDeID-Toolbox: Face De-Identification Toolbox: This paper presents FDeID-Toolbox, a modular face de-identification toolbox that uniformly integrates 16 de-identification methods (spanning four categories: naive, generative, adversarial, and K-Same), 6 benchmark datasets, and a systematic evaluation protocol covering three dimensions—privacy protection, attribute preservation, and visual quality—addressing the field's persistent problems of fragmented implementations, inconsistent evaluation protocols, and incomparable results.
FDeID-Toolbox: Face De-Identification Toolbox: This paper proposes FDeID-Toolbox, a modular face de-identification research toolbox comprising four standardized components—data loading, unified method implementation, flexible inference pipeline, and systematic evaluation protocol—enabling, for the first time, fair and reproducible comparisons across diverse de-identification methods along three dimensions: privacy protection, utility preservation, and visual quality.
Few-shot Acoustic Synthesis with Multimodal Flow Matching: This paper proposes FLAC, the first flow matching-based few-shot room impulse response (RIR) generation framework, capable of synthesizing spatially consistent acoustic responses in unseen scenes from a single recording. It further introduces AGREE, a joint embedding for geometry–acoustic consistency evaluation.
FG-Portrait: 3D Flow Guided Editable Portrait Animation: FG-Portrait introduces "3D optical flow" — directly computed from the FLAME parametric 3D head model without any learning — as a geometry-driven motion correspondence signal. Combined with depth-guided sampling for 3D flow encoding as the motion condition for a diffusion model ControlNet, the method achieves substantially improved motion transfer accuracy (APD reduced by 22%+) and supports inference-time expression and head pose editing.
Flash-Unified: Training-Free and Task-Aware Acceleration for Native Unified Models: FlashU conducts the first systematic redundancy analysis of native unified multimodal models, identifying parameter specialization and computational heterogeneity. Based on these findings, it proposes a training-free, task-aware acceleration framework that achieves 1.78×–2.01× speedup on Show-o2 while maintaining SOTA performance, through FFN pruning, dynamic layer skipping, adaptive guidance scaling, and diffusion head caching.
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation: FontCrafter reframes artistic font generation as a visual in-context generation task. By horizontally concatenating reference element images with a blank canvas and feeding the result into a pretrained inpainting model (FLUX.1-Fill), it achieves high-fidelity element-driven font creation, significantly outperforming existing methods in both texture and structural fidelity.
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems: This paper proves that the DDIM deterministic reverse chain is equivalent to a Partitioned Iterated Function System (PIFS), and derives three computable quantities from fractal geometry—the contraction threshold \(L_t^*\), the diagonal expansion function \(f_t(\lambda)\), and the global expansion threshold \(\lambda^{**}\)—providing a unified theoretical explanation for four empirically motivated design choices: cosine schedule offset, resolution logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling schedule.
Fractals made Practical: Denoising Diffusion as Partitioned Iterated Function Systems: This paper proves that the DDIM deterministic reverse chain is essentially a Partitioned Iterated Function System (PIFS), and derives from this framework three computable geometric quantities requiring no model evaluation. It provides a unified, first-principles explanation for the two-phase denoising dynamics of diffusion models, the effectiveness of self-attention, and four empirical design choices (cosine schedule offset, resolution-dependent logSNR shift, Min-SNR loss weighting, and Align Your Steps sampling).
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution: FRAMER proposes a frequency-aligned self-distillation training framework that uses final-layer feature maps as teacher supervision for intermediate layers. By applying IntraCL and InterCL contrastive losses to low-frequency (LF) and high-frequency (HF) components respectively, along with Frequency-based Adaptive Weight (FAW) and Frequency-based Adaptive Modulation (FAM), FRAMER significantly improves high-frequency detail recovery in diffusion-based real-world image super-resolution without modifying the network architecture or inference pipeline.
Frequency-Aware Flow Matching for High-Quality Image Generation: FreqFlow explicitly incorporates frequency-domain awareness into the flow matching framework via a dual-branch architecture that processes low-frequency global structures and high-frequency detail information separately, achieving state-of-the-art FID of 1.38 on ImageNet-256.
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition: This paper identifies an intrinsic connection between image layer decomposition and inpainting/outpainting, and proposes the Outpaint-and-Remove framework, which efficiently adapts a pretrained inpainting DiT model (FLUX.1-Fill-dev) for layer decomposition via lightweight LoRA fine-tuning. A multi-modal context fusion module is introduced to preserve fine details. The method achieves state-of-the-art performance using only 100K synthetic training samples.
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories: This paper introduces Garments2Look, the first large-scale multimodal outfit-level virtual try-on dataset (80K pairs, 40 categories, 300+ subcategories). Each sample contains 3–12 reference garment images, a model outfit image, and detailed textual annotations. The dataset exposes significant shortcomings of existing methods in multi-layer outfit composition and accessory consistency.
Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication: This paper models the watermark embedding and extraction process in diffusion models as communication over a noisy channel, and proposes the Gaussian Shannon framework. By cascading majority voting and LDPC error-correcting codes, the framework achieves bit-exact watermark recovery (rather than mere threshold-based detection), attaining state-of-the-art bit accuracy and detection rates across three Stable Diffusion versions and seven types of perturbation.
GIST: Towards Design Compositing: This paper proposes GIST, a training-free identity-preserving image compositing method that achieves style harmonization across multi-source visual elements via cross-attention-guided token injection and Flow Matched latent space initialization, serving as a plug-and-play compositing stage between layout prediction and typography generation.
gQIR: Generative Quanta Image Reconstruction: This work adapts a large-scale text-to-image latent diffusion model to the extreme photon-limited imaging regime of single-photon avalanche diodes (SPADs) via a three-stage framework—Quanta-aligned VAE → adversarially fine-tuned LoRA U-Net → FusionViT spatiotemporal fusion—enabling high-quality RGB image reconstruction from sparse binary photon detections and significantly outperforming all existing methods under extreme conditions of 10K–100K fps.
gQIR: Generative Quanta Image Reconstruction: This paper proposes gQIR, a modular three-stage framework that adapts large-scale text-to-image (T2I) diffusion models to the extreme photon-limited domain of SPAD sensors. It employs a quanta-aligned VAE (with a frozen encoder copy to prevent collapse), an adversarially fine-tuned LoRA U-Net for single-step generation, and a latent-space FusionViT for spatiotemporal fusion, enabling high-quality color image and video reconstruction from extremely sparse binary photon events.
GrOCE: Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models: GrOCE proposes a training-free concept erasure framework based on dynamic semantic graphs, achieving precise, context-aware online removal of target concepts in text-to-image diffusion models through three cooperative components: semantic graph construction, adaptive clustering identification, and selective severing.
Group Editing: Edit Multiple Images in One Go: This paper proposes GroupEditing, which reconstructs a group of related images as pseudo-video frames and combines explicit geometric correspondences from VGGT with the implicit temporal prior of a video diffusion model. Two specially designed positional encodings—Ge-RoPE and Identity-RoPE—are introduced to inject correspondence information, enabling cross-view consistent group image editing that significantly outperforms existing methods in visual quality, editing consistency, and semantic alignment.
Guiding a Diffusion Model by Swapping Its Tokens: This paper proposes Self-Swap Guidance (SSG), a training-free sampling guidance method for diffusion models that constructs perturbations by selectively swapping the most semantically dissimilar token pairs in the intermediate representation space. Compared to SAG/PAG/SEG, SSG stably generates high-fidelity images over a wider range of guidance scales, achieving state-of-the-art FID on both conditional and unconditional generation.
Guiding a Diffusion Transformer with the Internal Dynamics of Itself: This paper proposes Internal Guidance (IG), which adds auxiliary supervision losses to intermediate layers of a Diffusion Transformer to produce weaker generative outputs, then extrapolates the discrepancy between intermediate-layer and final-layer outputs at sampling time to achieve an Autoguidance-like effect — requiring no additional sampling steps or external model training. On ImageNet 256×256, IG pushes LightningDiT-XL/1 to FID 1.34 (without CFG) and 1.19 (+CFG), achieving state-of-the-art results among contemporaneous methods.
Guiding Diffusion Models with Semantically Degraded Conditions: This paper proposes Condition-Degradation Guidance (CDG), which replaces the null prompt \(\emptyset\) in CFG with a semantically degraded condition \(\boldsymbol{c}_{\text{deg}}\), transforming the guidance paradigm from a coarse-grained "good vs. empty" contrast to a fine-grained "good vs. slightly worse" contrast. Through a stratified degradation strategy—first degrading content tokens, then context-aggregating tokens—CDG constructs adaptive negative samples and achieves plug-and-play improvements in compositional generation accuracy on models including SD3, FLUX, and Qwen-Image, with negligible additional overhead.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation: This paper proposes HaltNav, a hierarchical navigation framework that combines lightweight text-based topological maps (osmAG) for global planning with a VLN model for local execution. A Reactive Visual Halting (RVH) mechanism is introduced to interrupt execution upon encountering unknown obstacles, update the topology, and trigger replanning for detour. The framework achieves significant improvements over baselines in both simulation and real-robot experiments.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation: This paper proposes HaltNav, a hierarchical navigation framework that combines lightweight textual topological priors (osmAG) for global planning with a VLN model for local execution. A Reactive Visual Halting (RVH) mechanism monitors egocentric observations to detect unexpected obstacles, dynamically updates the topology, and triggers replanning. The approach substantially improves long-range navigation robustness in both simulation and real-robot settings.
HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models: This paper proposes HAM, a training-free style transfer method that achieves high-quality stylization without sacrificing content identity. HAM applies heterogeneous modulation (GAR + LAT) to self-attention and cross-attention layers of diffusion models, complemented by style-injected noise initialization (SINI), attaining state-of-the-art performance across multiple metrics.
HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching: This paper proposes HazeMatching, a guided conditional flow matching (Guided CFM) framework for microscopy image dehazing. By incorporating degraded observations as conditioning signals in the velocity field, the method achieves high data fidelity and high perceptual quality simultaneously without requiring an explicit degradation operator, while also providing well-calibrated uncertainty estimates.
Heterogeneous Decentralized Diffusion Models: This paper proposes a heterogeneous decentralized diffusion framework that allows different experts to train completely independently using distinct diffusion objectives (DDPM ε-prediction and Flow Matching velocity-prediction). At inference time, a deterministic schedule-aware conversion unifies all expert outputs into velocity space for fusion. Compared to homogeneous baselines, the framework simultaneously improves FID and generation diversity while reducing computation by 16×.
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images: This paper proposes HiFi-Inpaint, a framework that leverages high-frequency information to enhance product detail features via Shared Enhancement Attention (SEA), combined with a Detail-Aware Loss (DAL) for pixel-level high-frequency supervision, achieving state-of-the-art detail fidelity in human-product image generation.
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning: This paper proposes an identity-constrained attribute tuning framework for diffusion-based face swapping: the method first constrains the identity solution space, then injects attribute conditions, and finally performs end-to-end refinement with identity and adversarial losses. Combined with a decoupled condition injection design, it achieves state-of-the-art FID (3.61) and identity retrieval accuracy (97.9% Top-1) on FFHQ.
Image Diffusion Preview with Consistency Solver: This paper proposes the Diffusion Preview paradigm and ConsistencySolver—a lightweight high-order ODE solver trained via reinforcement learning—that generates high-quality preview images with few-step sampling while ensuring consistency with full-step outputs. It achieves FID comparable to Multistep DPM-Solver using 47% fewer steps, reducing user interaction time by nearly 50%.
Image Generation as a Visual Planner for Robotic Manipulation: This work adapts a pretrained image generation model (DiT) via LoRA fine-tuning into a visual planner for robotic manipulation, generating temporally coherent action sequences in the form of \(3\times3\) grid images, supporting both text-conditioned and trajectory-conditioned control modes.
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval: This paper proposes DreamPRVR, which adopts a coarse-to-fine "imagine before concentrate" strategy: a truncated diffusion model generates global semantic register tokens under text supervision, which are then fused into fine-grained video representations to suppress spurious local noise responses, achieving state-of-the-art performance on three PRVR benchmarks.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards: This paper proposes SOLACE, a post-training framework that leverages the denoising self-confidence of text-to-image generation models as an intrinsic reward signal, requiring no external reward models while achieving consistent improvements in compositional generation, text rendering, and text-image alignment. SOLACE is also complementary to external rewards and mitigates reward hacking when combined with them.
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation: This paper proposes InnoAds-Composer, a single-stage e-commerce poster generation framework built on MM-DiT. It maps three types of conditions — product subject, glyph text, and background style — into a unified token space via unified tokenization, and combines a Text Feature Enhancement Module (TFEM) with an importance-aware condition injection strategy to maintain high-quality generation while significantly reducing inference cost.
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing: This paper proposes InterEdit, the first text-guided multi-human 3D motion editing framework. Through two alignment mechanisms—Semantic-Aware Plan Token Alignment and Interaction-Aware Frequency Token Alignment—InterEdit achieves precise editing of two-person interactive motions within a conditional diffusion model, while preserving source motion consistency and interaction coherence.
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing: This paper is the first to formally define the Text-guided Multi-human Motion Editing (TMME) task. It constructs the InterEdit3D dataset containing 5,161 source–target–instruction triplets and proposes the InterEdit conditional diffusion model. The model captures high-level editing intent via semantic-aware planning token alignment and models periodic interaction dynamics via interaction-aware frequency-domain token alignment, achieving state-of-the-art performance on instruction following (g2t R@1 30.82%) and source preservation (g2s R@1 17.08%), outperforming all four baselines across the board.
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders: This paper identifies that the majority of SAE neurons (~81%) suffer from insufficient interpretability or steerability, and proposes the CB-SAE framework — which prunes low-utility SAE neurons and augments them with a concept bottleneck module — achieving +32.1% interpretability and +14.5% steerability improvements on LVLM and image generation tasks, respectively.
Intra-finger Variability of Diffusion-based Latent Fingerprint Generation: This paper systematically evaluates the intra-finger variability of diffusion-model-based fingerprint synthesis. By constructing a latent fingerprint style library spanning 40 surface types and 15 development techniques, it enhances generation diversity and quantifies both local and global identity inconsistencies introduced during the generation process.
Intrinsic Concept Extraction Based on Compositional Interpretability: HyperExpress introduces a novel task termed Compositional Interpretability-based Intrinsic Concept Extraction (CI-ICE). By leveraging the hierarchical modeling capacity of hyperbolic space and a horospherical projection module, it extracts composable object-level and attribute-level concepts from a single image, enabling invertible decomposition of complex visual concepts.
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers: This paper proposes the Just-in-Time (JiT) framework, which dynamically selects sparse anchor tokens in the spatial domain to drive generative ODE evolution, and introduces a deterministic micro-flow (DMF) mechanism to ensure seamless activation of newly included tokens. JiT achieves up to 7× acceleration on FLUX.1-dev with negligible quality degradation.
Language-Free Generative Editing from One Visual Example: This paper reveals a critical text-visual alignment failure in text-guided diffusion models on simple visual transformations such as rain, haze, and blur, and proposes the VDC framework — which learns a purely visual conditioning signal from a single visual example pair (before and after transformation) to guide diffusion-based editing, requiring neither text nor training. VDC surpasses text-based and fine-tuning-based methods on tasks including deraining, dehazing, and denoising.
Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection: This paper identifies that real images exhibit stable layer-wise transitions in intermediate feature representations within a frozen CLIP ViT, whereas synthetic images exhibit abrupt attention shifts at intermediate layers. Based on this observation, the paper proposes Layer Transition Discrepancy (LTD) to model this difference, achieving mean Acc of 96.90% on UFD, 99.54% on DRCT-2M, and 91.62% on GenImage, surpassing all prior state-of-the-art methods.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories: This paper proposes LeapAlign, which constructs two-step leap trajectories to compress long generation paths into two steps, enabling reward gradients to be directly backpropagated to early generation steps. Combined with trajectory similarity weighting and gradient discounting strategies, LeapAlign achieves efficient post-training alignment of flow matching models.
Learnability-Guided Diffusion for Dataset Distillation: This paper proposes LGD, a learnability-driven incremental dataset distillation framework that constructs the distilled dataset in stages, conditioning each stage on the current model state to generate complementary rather than redundant training samples. By injecting learnability-score gradients into diffusion sampling, LGD reduces the 80–90% inter-sample information redundancy observed in existing methods by 39.1%, achieving 60.1% accuracy at 50 IPC on ImageNet-1K and 87.2% at 100 IPC on ImageNette.
Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition: Flora achieves robust skeleton–semantic cross-modal alignment via neighbor-aware semantic calibration, and constructs a distribution-aware open-form classifier using noise-free flow matching, attaining state-of-the-art performance on zero-shot skeleton action recognition—particularly under low-data training regimes.
Learning Latent Proxies for Controllable Single-Image Relighting: This paper proposes LightCtrl, a diffusion-based single-image relighting framework that achieves precise and continuous control over lighting direction, intensity, and color temperature. It introduces a few-shot latent proxy encoder to provide lightweight material–geometry priors, a lighting-aware mask to guide spatially selective denoising, and DPO post-training to enhance physical consistency. The method outperforms existing approaches on both synthetic and real-world benchmarks.
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal: This paper proposes the VeilGen + DeVeiler framework, which employs a physics-guided Stable Diffusion generative model to learn latent transmission and glare maps for synthesizing realistic compound-degradation training data. A restoration network trained under invertible constraints jointly removes aberrations and veiling glare in simplified optical systems.
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models: This paper proposes GvU, a self-supervised RL framework (based on GRPO) that leverages the visual understanding branch of a unified multimodal model (UMM) as an intrinsic reward signal. Token-level text-image alignment probabilities are used to iteratively improve T2I generation quality without any external supervision, achieving a 43.3% improvement on GenEval++. Notably, the enhanced generation in turn promotes fine-grained visual understanding.
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration: This paper proposes LESA, a framework that employs KAN (Kolmogorov-Arnold Network) as learnable temporal predictors, combined with a multi-stage multi-expert architecture and a two-phase training strategy. LESA achieves 5× acceleration on FLUX with only 1.0% quality degradation, 6.25× acceleration on Qwen-Image with 20.2% quality improvement over TaylorSeer, and 5× acceleration on HunyuanVideo with a 24.7% PSNR gain.
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras: This paper proposes a unified end-to-end color correction framework that jointly fuses data from a high-resolution RGB sensor and an auxiliary low-resolution multispectral (MS) sensor, integrating illuminant estimation, illuminant compensation, and color space conversion into a single model. The proposed approach reduces color error (\(\Delta E_{00}\)) by up to 50% compared to RGB-only and MS-only baselines.
Low-Resolution Editing is All You Need for High-Resolution Editing: ScaleEdit is the first work to formally define the high-resolution image editing task. It learns a 1×1 convolutional transfer function in the intermediate feature space of a pretrained generative model to inject fine-grained textural details from the source image, and employs a Blended-Tweedie-based patch synchronization strategy to ensure global consistency. Operating entirely via test-time optimization, the method achieves high-quality editing at resolutions up to 2K and even 8K.
LumiCtrl: Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models: This paper identifies a semantic gap in T2I model text encoders that prevents understanding of standard lighting terminology (e.g., tungsten, 6500K), and proposes LumiCtrl, which learns illumination prompts via three components — physics-based lighting augmentation, edge-guided prompt disentanglement, and masked reconstruction loss — enabling precise text-guided lighting control while preserving subject identity.
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness: This paper proposes the MAGIC framework, which fine-tunes an inpainting diffusion model and incorporates three complementary modules—Gaussian prompt perturbation, mask-guided spatial noise injection, and context-aware mask alignment—to generate high-fidelity, diverse, and spatially plausible industrial anomaly images under few-shot conditions, achieving state-of-the-art performance on downstream tasks using MVTec-AD.
Match-and-Fuse: Consistent Generation from Unstructured Image Sets: Match-and-Fuse is proposed as the first training-free consistent generation method for unstructured image sets. Images are treated as nodes and image pairs as edges to construct a pairwise consistency graph. Multi-view Feature Fusion (MFF) and feature guidance are employed to manipulate internal features during diffusion inference, achieving set-level cross-image consistency with a DINO-MatchSim of 0.80, substantially outperforming all baselines.
Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping: This paper proposes DiT-BlockSkip, a framework that reduces LoRA fine-tuning memory on FLUX by approximately 50% while maintaining comparable personalized generation quality. It achieves this through two components: timestep-aware dynamic patch sampling (low-resolution training with dynamically adjusted crop sizes) and a block skipping strategy that identifies critical blocks via cross-attention analysis and precomputes residual features for skipped blocks.
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models: This paper proposes MICON-Bench, a multi-image context generation benchmark covering 6 tasks (1,043 cases), paired with an MLLM-driven Evaluation-by-Checkpoint automated assessment framework. It further introduces DAR (Dynamic Attention Rebalancing), a training-free mechanism that improves generation consistency and quality in unified multimodal models (UMMs) by dynamically adjusting attention weights at inference time.
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation: This paper proposes Mixture of States (MoS)—a multimodal fusion paradigm based on learnable token-level sparse routing—enabling visual tokens to adaptively select hidden states from arbitrary layers of a text encoder at each denoising step. With only 3–5B parameters, MoS matches or surpasses models at the 20B scale.
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing: MorphAny3D is proposed as the first training-free 3D morphing framework based on Structured Latent (SLAT) representations. It achieves state-of-the-art quality in cross-category 3D morphing through Morphing Cross-Attention (MCA) for structurally coherent source/target fusion, Temporal-Fused Self-Attention (TFSA) for temporal consistency, and a direction correction strategy to eliminate abrupt orientation jumps.
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification: The paper proposes the MOS framework to address optical-SAR cross-modal ship re-identification. It comprises two core modules: (1) MCRL, which reduces the modality gap during training via SAR image denoising and a category-level modality alignment loss; and (2) CDGF, which generates pseudo-SAR samples from optical images using a Brownian bridge diffusion model at inference time and fuses the resulting features. On the HOSS ReID dataset, MOS achieves a +16.4% R1 improvement in the SAR→Optical direction.
MPDiT: Multi-Patch Global-to-Local Transformer Architecture for Efficient Flow Matching: This paper proposes MPDiT, a multi-scale patch global-to-local diffusion Transformer architecture. Early layers process global context using large patches (4×4) with only 64 tokens, and later layers upsample to small patches (2×2) with 256 tokens for local detail refinement. This reduces GFLOPs by up to 50%, while the XL model achieves FID 2.05 (with CFG) at only 240 training epochs.
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation: This paper introduces MultiBanana — the first large-scale benchmark for systematically evaluating multi-reference image generation, comprising 3,769 evaluation samples with up to 8 reference images across 5 difficulty dimensions (cross-domain, scale mismatch, rare concepts, and multilingual). The benchmark reveals complementary failure modes: closed-source models tend to overfit reference details, while open-source models tend to ignore reference subjects.
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models: This paper proposes NLCE, a training-free three-stage concept erasure framework for text-to-image diffusion models. It achieves precise localized erasure of target concepts through spectrally-weighted representation modulation, attention-guided spatial gating, and gated feature scrubbing, while explicitly preserving semantically neighboring concepts. NLCE outperforms existing methods on Oxford Flowers, Stanford Dogs, celebrity identity, and sensitive content erasure benchmarks.
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models: This paper reinterprets SDE-based GRPO as distance optimization / contrastive learning, and proposes Neighbor GRPO — which completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbation of ODE initial noise, combined with a softmax distance surrogate policy for policy gradient optimization, while preserving all advantages of deterministic ODE sampling.
OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution: This paper proposes OARS, a framework that systematically addresses human preference alignment in generative real-world image super-resolution for the first time. It introduces COMPASS, an MLLM-based process-aware reward model, and a progressive online reinforcement learning pipeline (cold start → reference-guided RL → non-reference RL), significantly improving perceptual quality while preserving fidelity.
OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution: This paper proposes OARS, a framework that aligns generative real-world image super-resolution models with human visual preferences via COMPASS—an MLLM-based process-aware reward model—and a progressive online reinforcement learning pipeline, achieving adaptive balance between perceptual quality and fidelity.
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos: This paper presents Object-WIPER, the first training-free framework for removing objects and their associated visual effects (shadows, reflections, mirror images, etc.) in videos. It leverages text-visual cross-attention and visual self-attention within DiT to localize associated effect regions, achieves clean removal via foreground re-initialization and attention scaling, and introduces the TokSim metric along with WIPER-Bench, a real-world benchmark.
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers: This paper proposes ELIT (Elastic Latent Interface Transformer), which decouples computation from input resolution by inserting variable-length latent token interfaces and lightweight Read/Write cross-attention layers into DiTs. A single model supports multiple inference budgets, achieving 35.3% and 39.6% improvements in FID and FDD respectively on ImageNet-1K 512px.
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers: This paper proposes ELIT (Elastic Latent Interface Transformer), which inserts variable-length latent interfaces and lightweight Read/Write cross-attention layers into DiT, enabling a single model to dynamically adjust its computational budget at inference time while non-uniformly allocating computation to more difficult image regions, achieving up to 53% FID reduction on ImageNet 512px.
OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery: OpenDPR proposes a training-free, vision-centric framework that leverages diffusion models to offline generate diverse visual prototypes for target categories, and performs open-vocabulary change detection in remote sensing imagery via similarity-based retrieval in visual feature space at inference time, achieving state-of-the-art performance on four benchmark datasets.
OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation: This paper proposes OPRO, a parameter-efficient adaptation method based on orthogonal matrices. By applying learnable panel-specific orthogonal operators to position-aware queries and keys of a frozen backbone, OPRO explicitly modulates cross-panel attention interactions while preserving the pre-trained same-panel synthesis behavior. With only 0.93M additional parameters, it significantly improves the editing quality of multiple state-of-the-art methods on MagicBrush.
Organizing Unstructured Image Collections using Natural Language: This paper introduces a new task, Open Semantic Multi-Clustering (OpenSMC), and proposes the X-Cluster framework, which converts images into text via an MLLM and subsequently employs an LLM to automatically discover clustering criteria and semantic substructures. Without any human-specified priors, the framework organizes large-scale unlabeled image collections into multi-dimensional, multi-granularity, and interpretable semantic clusters.
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design: This paper proposes PhysGen, a unified framework that integrates physical constraints (aerodynamic efficiency) into 3D shape generation. It jointly encodes geometric and physical information into a unified latent space via a Shape-and-Physics VAE (SP-VAE), and employs a Flow Matching model with alternating updates between velocity steps and physics refinement to generate 3D shapes that are both visually plausible and physically efficient (e.g., automobiles with low drag coefficients).
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction: This paper proposes ReMD (Residual-Multigrid Diffusion), which embeds multigrid residual correction into each reverse sampling step of a diffusion model. By leveraging a multi-wavelet basis to construct a cross-scale hierarchy, ReMD achieves physics-consistent and efficient fluid super-resolution without requiring explicit PDEs.
Pixel Motion Diffusion Is What We Need for Robot Control: DAWN proposes a two-stage fully diffusion-based framework — a Motion Director that generates dense pixel motion fields as interpretable intermediate representations, and an Action Expert that converts these fields into executable robot action sequences — achieving SOTA on CALVIN (Avg Len 4.00), MetaWorld (Overall 65.4%), and real-world benchmarks, with substantially smaller model capacity and training data than competing methods.
PixelDiT: Pixel Diffusion Transformers for Image Generation: PixelDiT proposes a fully Transformer-based dual-level pixel-space diffusion model: a patch-level DiT captures global semantics while a pixel-level DiT refines textural details, achieving an FID of 1.61 on ImageNet without any VAE, and enabling direct text-to-image training at 1024-resolution in pixel space.
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion: This paper proposes PixelRush, a training-free high-resolution image generation framework that combines four components — partial inversion, few-step diffusion models, Gaussian filter blending, and noise injection — to compress 4K image generation time from several minutes to approximately 20 seconds (10×–35× speedup), while surpassing existing SOTA methods on FID/IS metrics.
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion: PixelRush is the first method to bring training-free high-resolution image generation into practical deployment. By truncating the reverse diffusion process via partial DDIM inversion to skip redundant low-frequency reconstruction steps, it enables few-step diffusion models to function within a patch-based refinement pipeline. Combined with Gaussian filter blending and noise injection to eliminate artifacts, the method generates 2K images in 4 seconds and 4K images in 20 seconds—10–35× faster than the state of the art while achieving superior FID.
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers: This paper proposes the PPCL framework, which employs linear probing and first-order CKA difference analysis to detect contiguous redundant layer intervals in MMDiT, combined with non-sequential distillation to enable depth pruning (plug-and-play) and width pruning (replacing text streams/FFNs with linear projections). The approach compresses Qwen-Image from 20B to 10B with only a 3.29% performance drop.
Pose-dIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification: Pose-dIVE leverages the SMPL model to jointly control human body pose and camera viewpoint, using a diffusion model to generate person images with diversified poses and viewpoints. This approach systematically alleviates distributional bias in Re-ID training data, consistently improving the generalization capability of arbitrary Re-ID models across multiple benchmarks.
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation: This paper introduces PosterIQ, a comprehensive benchmark for poster design evaluation, comprising 7,765 understanding annotations and 822 generation prompts across 24 task categories — including OCR, font perception, layout reasoning, design intent understanding, and compositionally-aware generation — to systematically diagnose the gap between current MLLMs and diffusion models in design cognition.
Precise Object and Effect Removal with Adaptive Target-Aware Attention: This paper proposes ObjectClear, a framework that decouples foreground removal from background reconstruction via Adaptive Target-Aware Attention (ATA), combined with Attention-Guided Fusion (AGF) and Spatially Varying Denoising Strength (SVDS) strategies, enabling precise removal of target objects along with their associated visual effects such as shadows and reflections. The work also introduces OBER, the first large-scale dataset for Object-Effect Removal.
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality: This paper proposes LivingSwap, the first video reference-guided face swapping model. Through a controllable pipeline of keyframe identity injection, source video reference completion, and temporal stitching, it achieves high-fidelity face swapping in long videos. The method stably injects the target identity while preserving expression, lighting, and motion details from the source video, reducing manual editing effort by 40×.
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models: This work systematically probes affordance capabilities in vision foundation models (VFMs), revealing that DINO encodes part-level geometric structure while Flux encodes verb-conditioned interaction priors. By training-free fusion of both, the method achieves zero-shot affordance estimation competitive with weakly supervised approaches.
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On: A virtual try-on framework built on Flow Matching DiT that significantly reduces inference overhead while maintaining high fidelity, achieved through latent multimodal condition concatenation, a temporal self-reference caching mechanism, and 3D-RoPE grouped condition injection. The framework supports multi-garment try-on and text-prompt-controlled outfit styling.
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On: PROMO is built on a FLUX.1-dev Flow Matching DiT backbone and achieves high-fidelity, efficient multi-garment virtual try-on without a traditional reference network, by combining latent-space multimodal condition concatenation, temporal self-reference KV caching, 3D-RoPE grouped conditioning, and a fine-tuned VLM style-prompt system. Inference is 2.4× faster than the non-accelerated baseline, and the method surpasses existing VTON and general image-editing approaches on VITON-HD and DressCode.
Prototype-Guided Concept Erasure in Diffusion Models: To address the difficulty of thoroughly erasing broad concepts (e.g., violence, nudity) from diffusion models, this paper proposes a training-free erasure method based on concept prototypes. The method clusters concept-differential directions in the CLIP embedding space to obtain image-space prototypes, optimizes these into a text prototype space via cosine similarity, and at inference time selects the best-matching prototype as a negative guidance signal to suppress target concepts in a classifier-free guidance fashion.
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow: This paper presents PSDesigner, an automated graphic design system that simulates the creative workflow of human designers. It operates through three collaborative modules — AssetCollector (resource collection), GraphicPlanner (tool-call planning), and ToolExecutor (PSD operation execution) — and is trained on CreativePSD, the first PSD-format design dataset, enabling the system to learn professional design workflows and directly generate editable PSD files.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards: To address poor subject consistency and insufficient text adherence in multi-subject personalized image generation, this paper proposes a scalable multi-subject data construction pipeline and Pairwise Subject-Consistency Rewards (PSR). Through two-stage training (SFT + RL), the method comprehensively outperforms existing state-of-the-art methods on the self-constructed PSRBench.
PureCC: Pure Learning for Text-to-Image Concept Customization: PureCC introduces a decoupled learning objective that separates "target concept implicit guidance" from "original condition prediction," coupled with a dual-branch training pipeline comprising a frozen representation extractor and a trainable flow model, along with adaptive guidance scaling \(\lambda^{\star}\) derived from projection error. This enables high-fidelity concept customization while minimizing disruption to the original model's behavior and capabilities.
Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge: This paper proposes the QUAD framework, which treats LoRA weights as runtime inputs rather than compiling them into the model graph. Combined with a distillation fine-tuning strategy that shares quantization parameters across LoRAs, QUAD enables a single compiled model to dynamically switch among multiple GenAI tasks on mobile NPUs, achieving 6× memory compression and 4× latency improvement.
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment: This paper proposes RAISE, a framework that models T2I generation as a requirement-driven adaptive evolutionary process. A requirement analyzer decomposes prompts into structured checklists; multi-action mutations (prompt rewriting + noise resampling + instruction-based editing) evolve candidate populations in parallel; tool-augmented visual verification eliminates non-compliant candidates each round. The result is adaptive inference-time scaling that achieves 0.94 SOTA on GenEval while reducing generated samples by 30–40% and VLM calls by 80% compared to reflection fine-tuning baselines.
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models: This paper proposes RAZOR, a ratio-aware multi-layer/multi-head selective editing framework that enables efficient and precise targeted unlearning in Transformer-based vision models such as CLIP, Stable Diffusion, and VLMs, while preserving overall model performance and quantization robustness.
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models: RAZOR selects the most critical layers and attention heads via ratio-aware gradient scoring that jointly measures forgetting pressure and retention alignment, and achieves precise, efficient targeted unlearning on CLIP, Stable Diffusion, and VLMs through a three-component constrained loss and an iterative expansion mechanism, with no performance degradation after quantization.
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark: This paper introduces RealUnify, the first benchmark specifically designed to evaluate the bidirectional synergy between understanding and generation capabilities in unified models. Through 1,000 manually annotated instances and a dual evaluation protocol (direct and stepwise), it reveals that current unified models, despite possessing both understanding and generation capabilities, still fail to achieve genuine capability synergy in end-to-end scenarios.
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning: This paper proposes MVC-ZigAL, a framework that improves single-view fidelity and cross-view consistency in few-step text-to-multiview diffusion models through multiview-aware MDP formulation, zigzag self-refining advantage learning, and Lagrangian dual constrained optimization.
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing: This paper identifies that relational distances between image patch pairs remain invariant under AI editing, and exploits this invariance to build Rel-Zero, a zero-watermarking framework that achieves robust content authentication against diverse generative edits without modifying the original image.
RenderFlow: Single-Step Neural Rendering via Flow Matching: RenderFlow recasts neural rendering as a single-step conditional flow matching problem from albedo to full-illumination images. Using G-buffers as conditions and a pretrained video DiT as the backbone, it achieves deterministic rendering more than 10× faster than diffusion-based methods (~0.19 s/frame). An optional sparse keyframe guidance module further improves physical accuracy, and inverse rendering is supported via a frozen backbone with lightweight adapters.
Resolving the Identity Crisis in Text-to-Image Generation: This paper identifies the "identity crisis" in text-to-image models for multi-person scene generation — manifesting as duplicated faces and identity merging — and proposes the DisCo framework. By combining a compositional reward function with GRPO-based reinforcement learning fine-tuning of a flow-matching model, DisCo achieves 98.6% unique face accuracy, surpassing closed-source models including GPT-Image-1.
Reviving ConvNeXt for Efficient Convolutional Diffusion Models: This paper proposes FCDM (Fully Convolutional Diffusion Model), which adapts the ConvNeXt architecture as a backbone for conditional diffusion models. Using only 50% of DiT-XL's FLOPs, FCDM achieves a competitive FID of 2.03 on ImageNet and can train an XL-scale model on 4× RTX 4090 GPUs, demonstrating the severely underestimated efficiency of fully convolutional architectures in generative modeling.
RewardFlow: Generate Images by Optimizing What You Reward: RewardFlow proposes an inversion-free inference-time framework that fuses multiple differentiable reward signals—including semantic alignment, perceptual fidelity, local grounding, object consistency, and human preference—via multi-reward Langevin dynamics, achieving state-of-the-art editing fidelity and compositional alignment on image editing and compositional generation benchmarks.
Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring: Score2Instruct proposes SIG, a fully automated video quality instruction generation pipeline that requires neither human annotation nor closed-source APIs. By automatically evaluating 14 quality dimensions and aggregating them into comprehensive quality reasoning texts via hierarchical CoT, SIG constructs the S2I dataset (320K+ instruction samples). Combined with a two-stage progressive fine-tuning strategy, multiple video LMMs simultaneously acquire quality scoring and quality reasoning capabilities, achieving an average SRCC improvement of 26–31% across 5 VQA benchmarks.
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models: This paper proposes SeaCache, a training-free dynamic caching strategy based on a Spectral-Evolution-Aware (SEA) filter. By separating signal and noise components in the frequency domain to measure inter-timestep redundancy, SeaCache significantly improves the latency–quality trade-off in diffusion model inference.
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models: This paper proposes SegQuant, a framework that achieves high-fidelity post-training quantization of diffusion models through two novel components: SegLinear, a semantics-aware segmented quantization scheme based on static computational graph analysis, and DualScale, a hardware-native dual-scale polarity-preserving quantization scheme. The approach is cross-architecture generalizable and compatible with deployment pipelines, requiring neither handcrafted rules nor runtime dynamic information.
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models: This paper proposes SegQuant, a deployment-oriented post-training quantization (PTQ) framework for diffusion models. It achieves cross-architecture, high-fidelity W8A8/W4A8 quantization on SD3.5, FLUX, and SDXL via semantics-aware segmented quantization (SegLinear) based on static computational graph analysis and hardware-native dual-scale polarity-preserving quantization (DualScale), while maintaining compatibility with industrial inference engines such as TensorRT.
Self-Corrected Image Generation with Explainable Latent Rewards: This paper proposes xLARD, a framework that performs semantic self-correction in the latent space during text-to-image generation via a lightweight residual corrector. Guided by explainable latent reward signals (counting, color, position), xLARD achieves +4.1% on GenEval and +2.97% on DPGBench, and adapts to multiple backbones in a plug-and-play manner.
SHOE: Semantic HOI Open-Vocabulary Evaluation Metric: This paper proposes SHOE, an evaluation framework that decomposes HOI predictions into verb and object components and computes LLM-driven semantic similarity scores for each independently, replacing the exact-match paradigm of conventional mAP. SHOE achieves 85.73% agreement with human judgments on open-vocabulary HOI detection evaluation, surpassing the average inter-annotator agreement of 78.61%.
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement: ShowTable introduces a novel task termed creative table visualization (generating infographics from structured data tables) and proposes a progressive self-correction pipeline in which an MLLM (for reasoning and reflection) and a diffusion model (for generation and refinement) collaborate iteratively. Through a dedicated fine-tuned rewriting module and an RL-optimized refinement module, the framework consistently and substantially improves visualization quality over all baseline models on the newly constructed TableVisBench benchmark.
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images: This paper proposes SimLBR, which regularizes a detector by blending a small amount of fake image information into real image embeddings within the DINOv3 latent space, compelling the model to learn a compact decision boundary around the real image distribution. This design achieves strong generalization to unseen generators, attaining 94.54% average accuracy on GenImage and outperforming AIDE on the challenging Chameleon benchmark by 25% in accuracy and 70% in recall.
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation: This paper analyzes the bottleneck of severely skewed acceptance-length distributions in Speculative Jacobi Decoding (SJD) for text-to-image generation, and proposes the SJD-PAC framework. By introducing two techniques—Proactive Drafting (PD) and Adaptive Continuation (AC)—SJD-PAC achieves a strictly lossless 3.8× inference speedup, substantially surpassing the ~2× acceleration of vanilla SJD.
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking: SLICE decomposes image semantics into four factors (subject / environment / action / detail), anchors each factor to a distinct spatial partition of the diffusion model's initial noise, and thereby enables fine-grained, semantic-aware watermarking—capable of not only detecting tampering but also precisely localizing which semantic factor has been altered, entirely without training.
SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking: SLICE is a semantic watermarking framework that decomposes image semantics into four factors—subject, environment, action, and detail—and binds each factor to a distinct spatial partition of the initial Gaussian noise. This enables a three-state verification mechanism that not only detects watermark presence but also localizes semantic tampering. Against the strongest CSI attack, SLICE achieves an attack success rate (ASR) of only 19%, compared to 81% for SEAL.
Smoothing the Score Function for Generalization in Diffusion Models: An Optimization-based Explanation Framework: This paper theoretically demonstrates that memorization in diffusion models stems from the "sharpness" of empirical score function weights (concentration of softmax weights), and proposes two methods — noise unconditioning and temperature smoothing — that improve generalization and reduce memorization by smoothing score function weights while preserving generation quality.
SOLACE: Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards: SOLACE uses a T2I model's intrinsic denoising self-confidence (i.e., the accuracy with which it recovers injected noise) as an internal reward signal to replace external reward models in post-training, achieving consistent improvements in compositional generation, text rendering, and text-image alignment. The signal is also complementary to external rewards and can mitigate reward hacking.
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning: This paper proposes Spatial-SSRL, a self-supervised reinforcement learning paradigm that automatically constructs five pretext tasks (patch reordering, flip recognition, cropped patch inpainting, depth ordering, and relative 3D position prediction) from standard RGB/RGB-D images. By optimizing LVLMs with GRPO, the method achieves average improvements of 3.89%–4.63% across seven spatial benchmarks without any human annotation or external tools.
SPDMark: Selective Parameter Displacement for Robust Video Watermarking: SPDMark proposes a video diffusion model watermarking framework based on Selective Parameter Displacement (SPD). By learning a low-rank basis shift dictionary in the decoder and selecting combinations according to the watermark key, it achieves per-frame watermark embedding with imperceptibility, high robustness, and low computational overhead, while supporting temporal tampering detection and localization.
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars: This paper proposes a two-stage autoregressive adaptation framework (autoregressive distillation + adversarial refinement) that converts a bidirectional human video diffusion model into a real-time streaming generator. Reference Sink, RAPR positional re-encoding, and a consistency-aware discriminator are introduced to ensure long-video stability, realizing the first full-body real-time digital human that supports both speaking and listening interactions.
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts: To address the severe task interference problem in unified image generation and editing models, this paper proposes the TAG-MoE framework. By introducing a hierarchical task semantic annotation scheme and a predictive alignment regularization, TAG-MoE injects high-level task intent into local MoE routing decisions, transforming the gating network from a task-agnostic executor into a semantics-aware dispatcher. The method achieves the best overall open-source performance across five benchmarks including ICE-Bench, EmuEdit, GEdit, and DreamBench++.
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning: This paper proposes D2-Align, a framework that learns a directional correction vector in the reward model's embedding space to debias reward signals, addressing preference mode collapse (PMC) in RLHF-aligned diffusion models — a phenomenon where over-optimization of rewards leads to severe degradation in generation diversity. DivGenBench is also introduced as a benchmark for quantitative diversity evaluation.
Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models: This paper identifies that β-VAE tokenizers in latent diffusion models suffer from variance collapse, producing an overly compact latent space that is highly sensitive to diffusion sampling perturbations. The proposed Variance Expansion (VE) Loss achieves adaptive latent variance regulation through an adversarial balance between reconstruction and variance expansion objectives, consistently improving generation quality (FID 1.18) across multiple diffusion architectures.
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework: This paper proposes AC-DC, a three-stage denoiser (Auto-Correction + Directional Correction + Score Denoising) that addresses the manifold mismatch between ADMM iterations and the score training manifold. It provides the first convergence guarantee for ADMM-PnP combined with score-based denoisers, achieving state-of-the-art performance across multiple inverse problems.
Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework: This paper proposes ADMM-PnP with an AC-DC denoiser, which integrates diffusion priors into the ADMM primal-dual framework via a three-stage correct-then-denoise procedure (Auto-Correction + Directional Correction + score-based denoising). The method addresses the geometric mismatch between ADMM iterates and the diffusion training manifold, establishes convergence guarantees under two sets of conditions, and consistently outperforms baselines such as DAPS, DPS, and DiffPIR across seven inverse problems.
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control: WorldForge proposes a fully training-free inference-time guidance framework that adapts pretrained video diffusion models into precise camera-trajectory-controllable 3D/4D generation tools via three synergistic components—Intra-Step Recursive Refinement (IRR), Flow-Gated Latent Fusion (FLF), and Dual-Path Self-Corrective Guidance (DSG)—simultaneously surpassing both training-based and inference-based baselines in trajectory accuracy and perceptual quality.
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration: This paper proposes TAP, a framework that uses a first-layer probe to adaptively select the optimal predictor (from a Taylor expansion family) for each token at each step, enabling training-free diffusion model acceleration with a 6.24× speedup on FLUX.1-dev without perceptible quality degradation.
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model: TAUE proposes a training-free layered image generation framework that "transplants" intermediate denoising latents into the initial noise of a new generation process, combined with cross-layer attention sharing, to achieve consistent three-layer generation of foreground, background, and composite images — matching or surpassing fine-tuning-based methods.
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration: This paper proposes TC-Padé, a feature residual prediction framework based on Padé rational function approximation. Through adaptive coefficient modulation and a stage-aware strategy, TC-Padé achieves trajectory-consistent acceleration in low-step (20–30 steps) diffusion sampling scenarios (2.88× on FLUX.1-dev, 1.72× on Wan2.1), significantly outperforming existing Taylor-expansion-based methods.
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling: This paper proposes Composer, a plug-and-play meta-generator framework that dynamically generates low-rank parameter updates from each input condition at inference time and injects them into pretrained model weights, achieving instance-specific adaptive generation with negligible computational overhead (+0.2% time, +3.6% memory). The framework consistently improves performance across class-conditional generation, text-to-image synthesis, post-training quantization, and test-time scaling scenarios.
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering: This paper proposes TextPecker—a plug-and-play structural anomaly-aware RL strategy that constructs a character-level structural anomaly annotation dataset to train a structure-aware recognizer, replacing the noisy reward signals of conventional OCR. By jointly optimizing semantic alignment and structural fidelity, TextPecker achieves significant improvements in visual text rendering quality across multiple text-to-image models (FLUX, SD3.5, Qwen-Image).
The Universal Normal Embedding: This paper proposes the Universal Normal Embedding (UNE) hypothesis: the latent spaces of generative models (diffusion models) and visual encoders (CLIP, DINO) share an approximately Gaussian underlying geometric structure, and both can be viewed as noisy linear projections of this shared space. The hypothesis is validated through the NoiseZoo dataset and extensive experiments, and the paper demonstrates the feasibility of direct linear semantic editing in the DDIM inversion noise space.
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models: This paper proposes TINA (Text-free INversion Attack), which bypasses all text-based concept erasure defenses by optimizing DDIM inversion under null-text conditioning to recover a precise initial noise vector. The work demonstrates that existing erasure methods merely sever the text-to-image mapping without truly deleting the visual knowledge encoded in model parameters.
Tiny Inference-Time Scaling with Latent Verifiers: This paper proposes VHS (Verifier on Hidden States), a verifier that operates directly on the intermediate hidden states of a DiT generator, bypassing the decode–re-encode overhead. In the inference-time scaling setting for single-step image generation, VHS reduces joint generation-verification latency by 63.3% and FLOPs by 51%, while achieving a 2.7% performance gain on GenEval under the same time budget.
TokenLight: Precise Lighting Control in Images using Attribute Tokens: TokenLight formulates image relighting as an end-to-end image generation task conditioned on attribute tokens (intensity, color, ambient light, diffuse level, and 3D light source position), enabling precise, continuous, and interpretable lighting control within a diffusion Transformer framework.
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity: To address the problem of T2I models generating images that appear "too vivid to be real," this work proposes the Color Fidelity Dataset (CFD, 1.3M images), the Color Fidelity Metric (CFM, based on Qwen2-VL + softrank loss), and Color Fidelity Refinement (CFR, a training-free spatiotemporal adaptive guidance modulation scheme), forming an integrated evaluation-and-improvement framework.
Towards Robust Content Watermarking Against Removal and Forgery Attacks: This paper proposes ISTS, an instance-specific two-sided detection watermarking method that dynamically selects watermark injection timestep and location based on image semantics to resist both removal and forgery attacks. A two-sided detection mechanism is further designed to counter reverse latent representation attacks. ISTS achieves state-of-the-art robustness under both average and worst-case scenarios across three removal attacks and three forgery attacks.
TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking: This paper proposes TRACE, a document watermarking framework based on character structure encoding. It leverages a diffusion model (DragDiffusion) to precisely displace skeleton keypoints of characters for information embedding. Through three core components—Adaptive Diffusion Initialization (ADI), Guided Diffusion Encoding (GDE), and Masked Region Replacement (MRR)—TRACE simultaneously achieves cross-media robustness, multi-language/multi-font generalizability, and high visual imperceptibility.
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods: This paper proposes STALL, a training-free zero-shot generated video detector that jointly models per-frame spatial likelihoods and inter-frame temporal likelihoods in a whitened embedding space. It requires only real video calibration and achieves robust detection across diverse generative models.
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection: This paper proposes TriDF — the first benchmark that comprehensively evaluates interpretable DeepFake detection across three dimensions: Perception, Detection, and Hallucination. It comprises 55K high-quality samples covering 16 DeepFake types and 3 modalities, and reveals a triadic coupling relationship in which accurate perception is a prerequisite for reliable detection, yet hallucination can severely undermine decision-making.
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation: This paper proposes Uni-DAD, the first method to unify diffusion model distillation and domain adaptation into a single-stage pipeline. Through a dual-domain DMD loss and a multi-head GAN loss, Uni-DAD achieves high-quality and diverse generation in few-shot target domains using only 1–4 sampling steps.
Unified Vector Floorplan Generation via Markup Representation: This paper proposes the Floorplan Markup Language (FML), which encodes floorplan elements such as rooms and doors into structured token sequences. A LLaMA-style Transformer model (FMLM) trained on this representation unifies unconditional, boundary-conditioned, graph-conditioned, and completion tasks within a single framework, achieving over 80% lower FID than HouseDiffusion.
V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration: This paper reformulates image restoration as a progressive video generation process. By leveraging the rich visual priors of a pretrained video model (Wan2.2-TI2V-5B), the proposed method achieves versatile all-in-one restoration across multiple degradation types using only 1,000 multi-task training samples (less than 2% of existing methods), surpassing specialized architectures trained on million-scale datasets.
VeCoR — Velocity Contrastive Regularization for Flow Matching: This paper proposes VeCoR (Velocity Contrastive Regularization), which introduces a "negative velocity" contrastive signal into standard Flow Matching training. By simultaneously guiding the model on "where to go" and "where not to go," VeCoR achieves more stable trajectory evolution and higher perceptual fidelity—yielding relative FID reductions of 22% and 35% for SiT-XL/2 and REPA-SiT-XL/2, respectively, on ImageNet-1K.
Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization: BPO proposes a reference-free white-box T2I model verification method that employs a three-stage pipeline (adversarial anchor identification → binary search boundary exploration → target optimization) to locate model-specific semantic boundary regions. The generated verification prompts achieve an average accuracy of 96% and F1 of 0.93 across 5 T2I models, while being 2× faster than the TVN baseline.
ViHOI: Human-Object Interaction Synthesis with Visual Priors: This paper proposes ViHOI, a plug-and-play framework that leverages a VLM to extract decoupled visual and textual priors from 2D reference images, compresses them into compact condition tokens via Q-Former, and injects them into a diffusion model to enhance HOI motion generation quality. At inference time, a text-to-image model synthesizes reference images to enable strong generalization to unseen objects.
Vinedresser3D: Agentic Text-guided 3D Editing: This paper presents Vinedresser3D, a 3D editing agent centered on a multimodal large language model (MLLM) that requires no user-provided 3D masks. The system automatically interprets editing intent, localizes editing regions, generates multimodal guidance, and performs inversion-based inpainting in the latent space of a native 3D generative model (Trellis), enabling high-quality text-guided 3D asset editing.
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization: ViStoryBench constructs a comprehensive benchmark comprising 80 multi-style stories, 344 characters, and 1,317 shots, and proposes 12 automated evaluation metrics covering character consistency, style similarity, prompt alignment, and copy-paste detection. The benchmark systematically evaluates over 25 open-source and commercial story visualization methods, addressing the lack of unified evaluation standards in this field.
VOSR: A Vision-Only Generative Model for Image Super-Resolution: This paper proposes VOSR, the first work to demonstrate that a purely vision-trained generative super-resolution model can match or even surpass T2I pretrained-based methods. By leveraging visual semantic conditioning and a restoration-oriented guidance strategy, VOSR achieves high-quality SR at approximately 1/10 the training cost of T2I-based approaches.
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis: By decomposing weight changes during distillation into norm and direction components, this work finds that directional change is the primary driver of distillation (with a magnitude 22× larger than norm change). It proposes LoRaD (Low-Rank Weight Direction Rotation) adapters, integrated into the VSD framework to form WaDi, achieving state-of-the-art one-step FID on COCO with only ~10% trainable parameters.
When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization: This paper exposes the "identity collapse" bottleneck in multi-subject personalization: three SOTA models (MOSAIC, XVerse, PSR) already reach ~50% SCR at 2 subjects, surging to ~97% at 10 subjects. The paper proposes the DINOv2-based Subject Collapse Rate (SCR) metric to replace the inadequate CLIP-I, and constructs a systematic benchmark covering 2–10 subjects × 3 scene types.
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance: This paper proposes Conflict-aware Adaptive Safety Guidance (CASG), a training-free plug-and-play framework that resolves safety degradation caused by directional conflicts in multi-category aggregation. CASG dynamically identifies the harmful category most aligned with the current generation state and applies safety guidance exclusively along that direction.
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm: This paper presents a systematic comparative analysis of safety risks between MLLMs (Multimodal Large Language Models) and diffusion models, finding that MLLMs are more prone to generating unsafe images due to their superior semantic understanding (capable of interpreting abstract and non-English prompts), and that images generated by MLLMs are harder to detect by existing fake image detectors—even when detectors are fine-tuned specifically for MLLMs, detection can be circumvented by enriching prompt details.
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval: This paper proposes WISER, a training-free zero-shot composed image retrieval (ZS-CIR) framework that unifies T2I and I2I dual-path retrieval through an iterative "retrieve–verify–refine" loop. A VLM verifier explicitly models intent-awareness and uncertainty-awareness to enable adaptive fusion and structured self-reflective refinement. WISER achieves a relative improvement of 45% on CIRCO mAP@5 and 57% on CIRR Recall@1, surpassing many supervised methods.
YOEO: You Only Erase Once - Erasing Anything without Bringing Unexpected Content: YOEO proposes a single-pass erasure framework that distills a multi-step diffusion model into a few-step model for efficient inference. It introduces a Sundries Suppression Loss (which detects newly generated spurious objects via entity segmentation) and an Entity Feature Coherence Loss (which ensures semantic consistency between the erased region and its surroundings), addressing the hallucination problem of diffusion models in object erasure.