ECCV2024 Image Generation AI paper notes paper summaries Diffusion Models Text-to-Image Super-Resolution Image Editing Personalized Generation Adversarial Robustness

🎨 Image Generation¶

🎞️ ECCV2024 · 117 paper notes

📌 Same area in other venues: 📷 CVPR2026 (492) · 🔬 ICLR2026 (353) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (221)

🔥 Top topics: Diffusion Models ×54 · Text-to-Image ×13 · Super-Resolution ×7 · Image Editing ×7 · Personalized Generation ×7

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction: 2S-ODIS utilizes a pre-trained VQGAN (without fine-tuning) to synthesize panoramic images via a two-stage architecture: the first stage generates a low-resolution coarse ERP image, and the second stage corrects geometric distortions by generating and fusing 26 NFoV perspective images. This reduces training time from 14 days to 4 days while achieving superior image quality.
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks: Proposed IF-GMI, which decomposes the generator of a pre-trained StyleGAN2 into multiple blocks and optimizes intermediate features layer-by-layer (incorporating an \(\ell_1\)-ball constraint to prevent image collapse). This expands the search space of model inversion attacks from the latent space to intermediate features, boosting attack accuracy in OOD scenarios by up to 38.8%.
A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control: Controllable generation of 3D multi-tissue coronary artery segmentation maps is achieved using Latent Diffusion Models (LDM). Topological interaction loss ensures anatomical plausibility, and decoupled control over cross-sectional morphology and branch structure is obtained through dual-channel morpho-skeletal conditioning. Additionally, Adaptive Null Guidance (ANG) is proposed to efficiently enhance conditional fidelity using a non-differentiable regressor, ultimately supporting counterfactual anatomical editing for finite element simulation.
A High-Quality Robust Diffusion Framework for Corrupted Dataset: This paper proposes the RDUOT framework, which integrates Unbalanced Optimal Transport (UOT) into a diffusion model (DDGAN) for the first time. By learning \(q(x_0|x_t)\) instead of \(q(x_{t-1}|x_t)\), it effectively filters outliers in training data, achieving robust generation on corrupted datasets while outperforming the DDGAN baseline on clean datasets.
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation: This paper proposes AccDiffusion, which decouples global text prompts into patch-level content-aware prompts (utilizing cross-attention maps to determine whether each word belongs to a specific patch) and introduces dilated sampling with window interaction to improve global consistency. Without requiring extra training, this approach effectively solves the object duplication issue in patch-wise high-resolution image generation, achieving high-quality, duplication-free image extrapolation from 2K to 4K resolutions on SDXL.
AdaDiffSR: Adaptive Region-Aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution: Observing that the required denoising steps for different image regions in diffusion-based super-resolution vary significantly (background regions converge early while foreground textures still need iterations), this work proposes a dynamic step-skipping strategy based on Multi-Metric Latent Entropy (MMLE) to perceive information gain. Sub-regions are categorized into stable, growth, and saturated types, each assigned different step sizes. Concurrently, a Progressive Feature Injection (PFJ) module is developed to balance fidelity and realism. On datasets such as DRealSR, this approach achieves reconstruction quality comparable to StableSR while reducing inference time and FLOPs by 1.5\(\times\) and 2.7\(\times\), respectively.
AdaGen: Learning Adaptive Policy for Image Synthesis: This paper unifies step-level parameter scheduling (temperature, mask ratio, CFG scale, timestep, etc.) of multi-step generative models (MaskGIT/AR/Diffusion/Rectified Flow) as an MDP. A lightweight RL policy network is used to achieve sample-adaptive scheduling, and an adversarial reward design is proposed to prevent policy overfitting, consistently improving performance across four generative paradigms (e.g., VAR FID \(1.92 \rightarrow 1.59\), and reducing the inference cost of DiT-XL by 3x with superior performance).
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation: This work proposes AdaNAT, which models the generation policy configuration of Non-Autoregressive Transformers (NAT) as an MDP. Utilizing a lightweight policy network combined with PPO reinforcement learning and an adversarial reward model, AdaNAT automatically customizes generation policies (re-masking ratio, sampling temperature, CFG weights, etc.) for each sample. It achieves an FID of 2.86 on ImageNet-256 using only 8 steps, yielding an approximate 40% relative improvement over hand-crafted policies.
AFreeCA: Annotation-Free Counting for All: By leveraging Stable Diffusion to generate synthetic sorting/counting data, this work implements a two-stage strategy of learning sorting before anchoring counts, combined with density-guided image partitioning. This enables the first annotation-free counting method applicable to objects of arbitrary categories, outperforming existing unsupervised methods in crowd counting.
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation: Proposes AnyControl, which supports arbitrary combinations of multiple spatial control signals (depth, edge, segmentation, pose) via a Multi-Control Encoder featuring an alternating fusion and alignment block structure, outperforming existing methods on the COCO multi-control benchmark with an FID of 44.28.
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation: Be Yourself thoroughly analyzes the issue of multi-subject semantic leakage caused by cross-attention and self-attention in diffusion models, and proposes the Bounded Attention mechanism. By restricting the information flow between different subjects during the denoising process, it generates semantically independent multi-subject images, enabling the training-free generation of 5+ semantically similar subjects.
Beta-Tuned Timestep Diffusion Model: This paper provides an in-depth theoretical analysis of the forward process in diffusion models, revealing that distribution changes are most drastic in the early stages. Consequently, the authors propose B-TTDM (Beta-Tuned Timestep Diffusion Model), which replaces the uniform distribution with a Beta distribution for timestep sampling to better align training with the characteristics of the forward diffusion process, validating its effectiveness across multiple benchmark datasets.
Bridging the Gap: Studio-Like Avatar Creation from a Monocular Phone Capture: This work proposes a method to generate studio-quality facial texture maps from monocular phone videos, combining the \(W^+\) space parameterization of StyleGAN2 and diffusion-model-based super-resolution to bridge the gap from smartphone scans to high-quality 3D avatars.
ByteEdit: Boost, Comply and Accelerate Generative Image Editing: This work proposes ByteEdit, a framework that introduces human feedback learning into generative image editing (inpainting/outpainting). It improves editing quality through three reward models targeting aesthetics, alignment, and coherence, and accelerates inference utilizing adversarial training and progressive strategies.
Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning: This work proposes a method to identify "worst-case forget sets" from an adversarial perspective. It uses a bi-level optimization framework to find the hardest-to-forget data subsets, and leverages SignSGD to simplify the second-order BLO into a first-order problem, thereby more reliably evaluating the true efficacy of machine unlearning methods.
COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation: This paper proposes the COIN method, which simultaneously estimates high-quality global human motion and camera motion from monocular dynamic camera video through an improved version of Score Distillation Sampling via Control-Inpainting, combined with joint human-scene relationship losses.
Collaborative Control for Geometry-Conditioned PBR Image Generation: Proposes the Collaborative Control paradigm, which freezes a pre-trained RGB diffusion model and trains a parallel PBR model. By utilizing bi-directional cross-network communication layers to jointly model the RGB and PBR image distributions, it achieves high-quality geometry-conditioned PBR material image generation under limited data conditions.
ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement: The paper proposes ColorPeel, a method that learns a color prompt token on basic geometric shapes of target colors (disentangling color and shape) and introduces a cross-attention alignment loss, enabling T2I diffusion models to accurately generate objects with user-specified RGB colors.
Controlling the World by Sleight of Hand: Proposes CosHand, which uses binary hand masks as action conditions and fine-tunes on pretrained Stable Diffusion to predict future images after hand-object interaction, showing zero-shot generalization capabilities to robotic end-effectors.
DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution: Proposes DCDM (Diffusion-Conditioned-Diffusion Model), which learns the distribution of high-resolution scene text images through a dual-diffusion architecture. The first latent diffusion model generates character-level text embeddings as conditioning, while the second diffusion model generates high-resolution text images guided jointly by this condition and the low-resolution image, outperforming state-of-the-art methods on the TextZoom and Real-CE datasets.
Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers: This paper proposes Diff-Tracker, which is the first to leverage the rich visual semantic knowledge embedded in pre-trained text-to-image diffusion models (Stable Diffusion) for unsupervised object tracking. It achieves continuous tracking by learning a prompt representing the target and updating it online.
DiffiT: Diffusion Vision Transformers for Image Generation: DiffiT (Diffusion Vision Transformer) is proposed, which introduces a Time-dependent Multi-head Self-Attention (TMSA) mechanism to dynamically adjust self-attention behaviors at different stages of the denoising process, achieving a state-of-the-art (SOTA) FID score of 1.73 on ImageNet-256 with 16-20% fewer parameters than DiT/MDT.
Diffusion-based Image-to-Image Translation by Noise Correction via Prompt Interpolation: This paper proposes PIC (Prompt Interpolation-based Correction), a training-free image-to-image translation method for diffusion models. By constructing a noise correction term through progressive prompt embedding interpolation and linearly combining it with the noise prediction of the source image, PIC achieves structure-preserving, high-fidelity image editing with an inference speed (18.1s) outperforming all baseline methods.
Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning: This work proposes the DDDR framework, which is the first to introduce pre-trained diffusion models into Federated Class Continual Learning (FCCL). Through Federated Class Inversion technology, it learns a compact class embedding for each category, using the diffusion model to perform high-quality replay of historical data to combat catastrophic forgetting, and employs contrastive learning to bridge the domain gap between generated and real data.
Distilling Diffusion Models into Conditional GANs: Proposes the Diffusion2GAN framework, which distills multi-step diffusion models into single-step conditional GANs. The core innovations are the E-LatentLPIPS latent-space perceptual loss and a multi-scale conditional discriminator based on pretrained diffusion models, achieving performance that surpasses DMD, SDXL-Turbo, and SDXL-Lightning on the zero-shot COCO benchmark.
DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment: This paper proposes DreamDiffusion, which leverages temporal masked signal modeling for large-scale pre-training of an EEG encoder to learn robust brainwave representations. It then aligns the EEG-text-image space using additional supervision from a CLIP image encoder, and finally utilizes a pre-trained Stable Diffusion model to generate high-quality images directly from EEG signals, achieving portable and low-cost "thoughts-to-image" generation.
DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators: This work proposes DreamDrone, a zero-shot, training-free perpetual view generation pipeline. By directly warping the intermediate latent codes of a pretrained diffusion model (rather than performing image-level warping) and combining feature-correspondence guidance with a high-pass filtering strategy, DreamDrone synthesizes high-quality, geometrically consistent, and unbounded scenes.
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion: DreamMover is proposed to perform image interpolation between image pairs with large motions based on pre-trained text-to-image diffusion models. By utilizing three core components—diffusion-aware optical flow estimation, two-level latent space fusion, and self-attention concatenation and replacement—it generates semantically consistent intermediate frames.
EBDM: Exemplar-guided Image Translation with Brownian-bridge Diffusion Models: This paper proposes the EBDM framework, which models exemplar-guided image translation as a stochastic Brownian-bridge diffusion process, directly translating structural controls into realistic images. By integrating a Global Encoder, an Exemplar Network, and an Exemplar Attention Module, the framework effectively incorporates both the global style and detailed texture information of the exemplar image.
EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion: EchoScene is proposed, a 3D indoor scene generation method based on a dual-branch diffusion model. It achieves collaborative information exchange among multiple denoising processes during the scene graph diffusion process through an "Information Echo" mechanism, generating globally consistent and interactively controllable scenes.
Editable Image Elements for Controllable Synthesis: This work proposes an "Editable Image Elements" representation that decomposes an input image into a set of semantically aligned patch embeddings (similar to superpixels). Each patch is associated with spatial position and size attributes. Users can directly edit these attributes (moving, scaling, deleting), and a Stable Diffusion-based decoder then synthesizes realistic images from them.
EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation: EMDM is proposed to capture complex denoising distributions under large step sizes through a conditional denoising diffusion GAN. It enables real-time generation of high-quality human motions with no more than 10 sampling steps, improving inference speed by approximately 200 times compared to MDM.
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning: This paper proposes TexForce, which utilizes reinforcement learning (DDPO) combined with LoRA to fine-tune the text encoder of diffusion models, thereby improving text-image alignment and visual quality. It can be seamlessly combined with existing U-Net fine-tuning methods to achieve superior performance.
Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models: StableVSR is proposed, marking the first application of diffusion models to video super-resolution. By introducing a Temporal Conditioning Module (TCM) and a frame-wise bidirectional sampling strategy, it significantly enhances perceptual quality while ensuring temporal consistency across frames.
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing: By theoretically analyzing the role of the \(\eta\) parameter in the DDIM sampling equation, this work designs time- and region-dependent \(\eta\) functions to achieve more flexible and precise real image editing.
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction: This work proposes FineMatch, a benchmark that defines the task of aspect-based, fine-grained image-text mismatch detection and correction. It contains 49,906 high-quality, human-annotated image-text pairs and demonstrates the limitations of existing VLMs in fine-grained compositional understanding.
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis: FouriScale is proposed. From the perspective of frequency domain analysis, it replaces convolutional layers in pre-trained diffusion models with dilated convolutions and low-pass filtering, achieving training-free high-resolution image generation of arbitrary sizes, while theoretically proving the effectiveness of dilated convolutions in maintaining structural consistency.
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior: FreeCompose is proposed, leveraging the generative prior of pretrained diffusion models to achieve generic zero-shot image composition. It unifies image harmonization (appearance editing) and semantic image composition (semantic editing) under a single framework without any extra training.
FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models: Revisiting the image editing process of diffusion models from a frequency perspective, this work reveals that the denoising network preferentially restores low-frequency components, leading to a misalignment between editing guidance and the target region. The authors propose progressive frequency truncation (FreeDiff) to refine guidance signals in frequency space, achieving tuning-free, general image editing.
GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections: To address fine-grained semantic misalignment (component quantities, positions, and mutual relationships) in text-to-garment image generation, this work proposes GarmentAligner. It obtains spatial-quantitative information via an automatic component extraction pipeline and integrates retrieval-augmented contrastive learning with multi-level correction losses to achieve precise alignment of garment components at visual, spatial, and quantitative levels.
Generating 3D House Wireframes with Semantics: A 3D house wireframe generation method based on autoregressive models is proposed. It employs a unified wire representation instead of traditional separate vertex-edge modeling, generating semantically rich wireframe structures via semantically aware BFS sequence ordering and a two-stage coarse-to-fine Transformer decoder, which can be automatically segmented into semantic components like walls, roofs, and rooms.
Generating Human Interaction Motions in Scenes with Text Control: TeSMo is proposed as a text-controlled, scene-aware motion generation method. By pre-training a text-to-motion diffusion model on large-scale motion data and fine-tuning it with an enhanced scene-sensing branch, it generates realistic motion sequences of characters navigating obstacles and interacting with objects (e.g., sitting down) in 3D scenes in two stages (navigation + interaction).
Getting it Right: Improving Spatial Consistency in Text-to-Image Models: A systematic investigation of spatial relationship generation deficiencies in text-to-image models. Finding that existing vision-language datasets severely lack spatial descriptions, the authors construct the SPRIGHT dataset (~6 million images re-captioned with spatial relations). Fine-tuning with <500 multi-object images achieves SOTA on the T2I-CompBench spatial score (0.2133), representing a 41% improvement over the baseline.
Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation: Proposes the Prompt Pose Matching (PPM) framework, which leverages the rich knowledge in pre-trained text-to-image diffusion models to address Category-Agnostic Pose Estimation (CAPE). By learning pseudo prompts corresponding to keypoints, it achieves few-shot keypoint detection without training on base categories.
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects: This work proposes HIMO, the first large-scale full-body human-multi-object interaction 4D MoCap dataset (3.3K sequences, 4.08M frames) accompanied by detailed textual descriptions and temporal segment annotations. It also presents a dual-branch conditional diffusion model and an autoregressive pipeline to generate coordinated multi-object interaction motion sequences.
HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation: This paper proposes HybridBooth, a two-stage hybrid prompt inversion framework. By first generating an initial word embedding using a regressor (Probe) and then performing residual fine-tuning (Refinement), it achieves efficient subject-driven personalized image generation in only 3-5 iteration steps.
Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition: This paper proposes Idempotent Generative Models (IGM), theoretically establishing the equivalence between generative models and maximum entropy coding (spectral contrastive learning). By imposing idempotent constraints on the feature space of skeleton data, the features of the generative model become more compact and suitable for recognition tasks, improving the accuracy on NTU 60 xsub from 84.6% to 86.2%.
Implicit Concept Removal of Diffusion Models: The Geom-Erasing method is proposed, which leverages external classifiers/detectors to provide the existence and geometric location of implicit concepts. These are encoded as location tokens in the text conditioning and used as negative prompts, effectively eliminating the generation of "implicit concepts" such as watermarks and unsafe content in diffusion models. It achieves SOTA performance on both I2P and custom ICD benchmarks.
Implicit Style-Content Separation using B-LoRA: This paper proposes B-LoRA. By analyzing the SDXL architecture, it is discovered that implicitly separating style and content of a single image can be achieved by jointly training the LoRA weights of only two specific transformer blocks (Block 4 controls content, and Block 5 controls style), supporting various tasks such as style transfer, text-based stylization, and consistent style generation.
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm: Infinite-ID is proposed to separate identity information and text semantic information via an ID-semantic decoupling paradigm. In the training phase, text cross-attention is disabled to focus on learning identity embeddings. In the inference phase, the two streams of information are merged via a mixed attention mechanism and an AdaIN-mean operation, achieving both high-fidelity identity preservation and semantic consistency with a single reference image.
∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions: Proposes ∞-Brush, the first conditional diffusion model in infinite-dimensional function space. By introducing a cross-attention neural operator, it achieves controllable conditional generation. Trained on only 0.4% of pixels, it can generate large images maintaining global layout consistency at arbitrary resolutions up to 4096×4096.
IRGen: Generative Modeling for Image Retrieval: Redefines image retrieval as a generative modeling task, proposing IRGen—a sequence-to-sequence model that converts images into short sequences of discrete semantic tokens via a semantic image tokenizer, and then autoregressively generates the identifier of the query image's nearest neighbor, achieving end-to-end differentiable retrieval and reaching state-of-the-art (SOTA) performance across three standard benchmarks.
L-DiffER: Single Image Reflection Removal with Language-Based Diffusion Model: L-DiffER is proposed, a language-guided diffusion model that addresses the issue of inaccurate control conditions through an iterative condition refinement strategy. It integrates a multi-condition constraint mechanism to ensure the color and structural fidelity of image restoration, while preserving the generative capability of diffusion models to handle low-transmission reflections.
Latent Guard: A Safety Framework for Text-to-Image Generation: This paper proposes the Latent Guard framework, which learns a latent space on top of the text encoder of T2I models. Through contrastive learning, it maps blacklist concepts and input prompts containing these concepts to nearby locations, achieving highly efficient unsafe prompt detection (ID Explicit AUC 0.985) and allowing flexible updates of the blacklist during test time without retraining.
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer: Unifying the object detection framework DETR with generative models (GAN/VAE), this paper proposes LayoutDETR for automatic graphic layout design under multimodal conditions. It is constrained by background images and driven by foreground image-text elements, achieving state-of-the-art (SOTA) performance in ad banner and UI layout generation.
Lazy Diffusion Transformer for Interactive Image Editing: Proposes LazyDiffusion, an asymmetric encoder-decoder Transformer architecture that compresses global information via a context encoder and executes diffusion denoising only on the masked region, achieving a 10× speedup with image quality comparable to full-image generation methods during interactive image editing.
LCM-Lookahead for Encoder-Based Text-to-Image Personalization: This paper proposes utilizing Latent Consistency Model (LCM) as a "shortcut" to enable backpropagation of image-space losses (e.g., identity recognition loss) during the training of diffusion model encoders. Combined with self-attention feature sharing and consistent data generation, this approach significantly enhances identity preservation and prompt alignment in encoder-based facial personalization.
Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation: The DP-SAD framework is proposed to train differentially private diffusion models via stochastic adversarial distillation. It leverages the diffusion model's timesteps to dilute the impact of DP noise, introduces a discriminator to accelerate convergence, and combines the gradient chain rule with DP's post-processing property to reduce the introduction of randomness, achieving SOTA privacy-preserving image generation quality without requiring pre-training.
Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction: Ours proposes the Semantic Latent Directions (SLD) method. By constructing a set of orthogonal latent base directions and representing future motion hypotheses as a linear combination of these directions, more accurate, diverse, and semantically controllable motion prediction is achieved in stochastic human motion prediction.
Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality: A missing-modality AVQA framework based on trimodal relations is proposed. It recalls missing modality features via the RMM generator and enhances them cross-modally using the AVR diffusion model, achieving accurate question answering even when audio or visual modality is missing.
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning: This paper proposes LEGO, a model that enhances the action description capability of VLLMs through visual instruction tuning and injects the image/text embeddings of VLLMs as additional conditions into a diffusion model, enabling the generation of action execution frames from an egocentric perspective.
Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models: The Lego method is proposed to achieve the disentanglement and inversion of personalized concepts beyond appearance (such as adjectives and verbs) through subject separation and context loss for personalized content generation in diffusion models.
Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation: This paper proposes LC-GAN, which achieves unsupervised geometry-appearance feature disentanglement in the GAN latent space through contrastive feature categorization and spectral regularization. This enables independent linear control of various attributes in generated images, achieving SOTA generation quality on FFHQ, CelebA-HQ, and AFHQ-V2.
LivePhoto: Real Image Animation with Text-guided Motion Control: The LivePhoto image animation framework is proposed to address the ambiguity of text-to-motion mapping through a motion intensity estimation module and a text reweighting module. It achieves high-quality video generation based on real images and text descriptions, allowing users additional control over motion intensity.
Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation: GuidedMotion is proposed to guide global motion diffusion generation using local actions as fine-grained control signals. By estimating guidance weights through semantic graph parsing and Graph Attention Networks, it supports continuously adjustable motion control, demonstrating significant advantages in generating complex multi-action motions.
M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models: Proposes M2D2M, which generates multi-segment continuous human motion sequences based on discrete diffusion models, achieving smooth transitions between actions through dynamic transition probabilities and a Two-Phase Sampling (TPS) strategy without requiring additional multi-motion training data.
MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion: The diffusion model is applied to skeleton representation learning for the first time, proposing the Masked Conditional Diffusion (MacDiff) framework. It extracts representations of the masked skeleton via a semantic encoder to guide a conditional diffusion decoder for denoising, thereby unifying the discriminative and generative modeling of skeletons.
MagicEraser: Erasing Any Objects via Semantics-Aware Control: Proposes MagicEraser, an object erasure framework based on diffusion models. Through a three-stage design of content initialization, prompt tuning, and semantics-aware attention refocusing, it achieves high-quality object erasure and harmonious background generation without requiring user text inputs.
Memory-Efficient Fine-Tuning for Quantized Diffusion Model: Proposes TuneQDM, the first memory-efficient fine-tuning method for quantized diffusion models. By introducing multi-channel quantization scale updates and a timestep-aware scale strategy, it achieves personalized generation quality on a 4-bit quantized model close to that of the full-precision counterpart.
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed-Precision Quantization: To address the issue that few-step diffusion models (such as SDXL-turbo 1-step) are harder to quantize than multi-step models, this paper proposes MixDQ, a mixed-precision quantization method. It incorporates BOS-aware text embedding quantization, metric-decoupled sensitivity analysis, and integer programming-based bit allocation. Under W4A8, it increases FID by only 0.5, achieving 3x model compression and 1.5x speedup.
MotionChain: Conversational Motion Controllers via Multimodal Prompts: This paper proposes MotionChain, a unified vision-motion-language model that generates continuous, long-term human motion sequences across multi-turn conversations via multimodal prompts, supporting the joint understanding and generation of text, images, and motion.
MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model: Proposes MotionLCM, which first introduces consistency distillation into human motion generation, achieving real-time motion generation (~30ms/sequence) via single-step/few-step inference in the motion latent space, and realizes real-time controllable motion generation in the latent space through Motion ControlNet.
MultiGen: Zero-Shot Image Generation from Multi-modal Prompts: This paper proposes MultiGen, which constructs an "augmented token" for each object by fusing text, spatial coordinates, and image features. By training coordinate and feature models to handle missing modalities during inference, it achieves the first zero-shot image generation from multi-object multi-modal prompts, supporting flexible inputs of text-only or arbitrary modality combinations.
Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion: MVSD is proposed, a mutual learning framework based on diffusion models that jointly trains visual acoustic matching (VAM) and dereverberation as symmetric mutual-inverse tasks. This framework leverages their reciprocal relationship to overcome paired data scarcity, marking the first application of diffusion models to visually-guided reverberation style transfer.
NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation: Proposes NeuSDFusion, a 3D shape generation framework based on a hybrid tri-plane SDF representation (NeuSDF) and a spatial-aware Transformer autoencoder. By preserving the spatial correspondences among tri-planes, it achieves state-of-the-art (SOTA) performance in tasks such as unconditional generation, multimodal shape completion, single-view reconstruction, and text-to-3D generation.
NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model: This paper proposes NL2Contact, which is the first to leverage natural language descriptions for the controllable modeling of 3D hand-object contact maps. It generates hand poses and contact areas from text using a staged diffusion model, and constructs ContactDescribe, the first hand-object contact dataset with fine-grained linguistic descriptions.
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models: OMG is proposed as an occlusion-friendly personalized multi-concept image generation framework. Through two-stage sampling (layout generation + concept noise blending), it achieves strong identity preservation and natural lighting harmonization. It can be integrated out-of-the-box with various single-concept models (such as LoRA and InstantID) without requiring additional training.
OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model: This work proposes OmniSSR, the first zero-shot omnidirectional image super-resolution method based on diffusion models. By utilizing Octadecahedral Tangent Image Interaction (OTII) and Gradient Decomposition (GD) correction techniques, OmniSSR leverages the image prior of Stable Diffusion to achieve a balance between fidelity and realism, requiring no training or fine-tuning.
PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance: PanoFree is proposed, a tuning-free multi-view image generation method that efficiently generates consistent panoramic images through iterative warp-and-inpaint, cross-view self-guidance, and symmetric bidirectional generation strategies.
Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization: This paper proposes PASD (Pixel-Aware Stable Diffusion), which enables the diffusion model to perceive local image structures at the pixel level through a Pixel-Aware Cross Attention (PACA) module. Combined with a degradation removal module and an adjustable noise schedule, it achieves a unified framework for realistic image super-resolution and personalized stylization, where the style can be switched simply by replacing the base model.
Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos: This paper proposes a method for learning articulated 3D animal motion generative models from unlabeled internet videos. By decomposing videos into static shape, appearance, and motion latent codes via a video photo-geometric autoencoding framework, the method enables the generation of diverse 4D animations from a single image during inference without requiring any pose annotations or parametric shape models.
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning: This work models personalized T2I generation as a Deterministic Policy Gradient (DPG) framework—with the diffusion model acting as the policy and the denoising steps as actions. By introducing a "look forward" mechanism to capture long-term visual consistency and a DINO similarity reward, it improves the DINO score from 0.694 to 0.738 (+6.3%) and CLIP-I from 0.762 to 0.797 (+4.6%) on the DreamBooth benchmark.
Probabilistic Weather Forecasting with Deterministic Guidance-Based Diffusion Model: This paper proposes DGDM (Deterministic Guidance Diffusion Model), which jointly trains a deterministic prediction branch and a Brownian Bridge-based probabilistic-diffusion branch. By utilizing deterministic forecasting results to truncate the reverse diffusion process, the model controls the range of uncertainty while achieving both accurate and probabilistic weather forecasting, reaching SOTA performance in both global and regional forecasting tasks.
Prompting Future Driven Diffusion Model for Hand Motion Prediction: This paper proposes PromptFDDM, a prompt-based future-driven diffusion model for hand motion prediction. By combining a Spatial-Temporal Extractor Network (STEN) with the guidance mechanism of a Ground Truth Extractor Network (GTEN) and a Reference Data Generation Network (RDGN), alongside interactive prompt augmentation, the model achieves SOTA performance in both first-person and third-person hand motion prediction.
Realistic Human Motion Generation with Cross-Diffusion Models: Proposes the CrossDiff framework, which integrates 3D and 2D motion information through a unified encoding and cross-decoding mechanism. It leverages cross-diffusion to capture finer full-body motion details and supports learning 3D motion generation from in-the-wild 2D data.
RegionDrag: Fast Region-Based Image Editing with Diffusion Models: Proposes RegionDrag, a region-based copy-and-paste drag editing method, which replaces point-based drag instructions with region-based instructions to achieve faster (over 100x), more precise, and intention-clearer image editing.
Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis: Identifies the misalignment issue between training and testing latent code distributions in IMLE. Proposes RS-IMLE, which alters the training prior distribution via rejection sampling, achieving an average FID reduction of 45.9% across nine few-shot image datasets.
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment: The authors identify distributional discrepancies (such as word frequency differences) between positive and negative captions at the dataset level, and propose using a text-only classifier to filter out biased data. Fine-tuning LLaVA-1.5 with the debiased dataset yields LLaVA-score, a State-Of-The-Art image-text alignment scoring model.
ReNoise: Real Image Inversion Through Iterative Noising: Proposed the ReNoise iterative renoising method to improve the image inversion quality of diffusion models. By applying the UNet multiple times at each inversion timestep and averaging the predictions, it improves trajectory estimation accuracy, which is particularly effective for few-step diffusion models (such as SDXL Turbo and LCM).
RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification: This paper deeply analyzes the source of robustness of the Tree-Ring watermarking method (discovering that distribution shift is an unexpected hidden helper in its verification task), reveals its severe limitations in multi-key identification tasks, and proposes RingID—a multi-channel heterogeneous watermarking framework. Through discretization, lossless embedding, and a more circular ring design, RingID improves the identification accuracy for 2048 keys from 0.07 to 0.82.
Robust-Wide: Robust Watermarking against Instruction-driven Image Editing: This paper proposes Robust-Wide, the first robust watermarking method against instruction-driven image editing. The core innovation is the Partial Instruction-driven Denoising Sampling Guidance (PIDSG) module, which opens the gradient flow of the last \(k\) steps of the editing process during training. This forces the watermark to be embedded into semantic-aware areas, achieving a bit error rate (BER) of only about 2.6% for 64-bit watermarks after editing.
RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models: RodinHD is proposed to address the catastrophic forgetting problem of the triplane decoder and achieve high-fidelity 3D avatar generation through hierarchical portrait representation injection.
RPBG: Towards Robust Neural Point-based Graphics in the Wild: To address the lack of robustness of Neural Point-based Graphics (NPBG) in real-world scenarios, this paper proposes RPBG. Through a downgrade-aware convolution module, attention-driven point visibility correction, lightweight background modeling, and point cloud enhancement, RPBG significantly improves the quality and stability of point cloud neural re-rendering across various in-the-wild datasets without modifying the point rasterization pipeline.
SAIR: Learning Semantic-aware Implicit Representation: This paper proposes Semantic-Aware Implicit Representation (SAIR). By constructing two modules, Semantic Implicit Representation (SIR) and Appearance Implicit Representation (AIR), SAIR integrates text-aligned semantic embeddings extracted by CLIP into implicit neural functions. This enables it to significantly outperform methods relying solely on appearance information in image inpainting tasks with large missing regions, achieving a PSNR improvement of 1.65-2.69dB on CelebA-HQ.
Scalable Group Choreography via Variational Phase Manifold Learning: This paper proposes PDVAE (Phase-conditioned Dance VAE), a phase-conditioned variational generative model for scalable group choreography. By learning the phase manifold (amplitude, frequency, offset, phase shift) of dance motion in the frequency domain, it achieves high-quality group dance generation for an arbitrary number of dancers with constant memory consumption, comprehensively outperforming existing methods on the AIOZ-GDance and AIST-M datasets.
ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation: This paper proposes Asynchronous Score Distillation (ASD), which reduces noise prediction error and aligns the distribution of rendered images by shifting diffusion timesteps forward (rather than fine-tuning the diffusion model). This addresses the issue of VSD fine-tuning destroying text comprehension capabilities, thereby achieving stable training and prompt-amortized 3D generator training scalable to 100,000 text prompts.
Shedding More Light on Robust Classifiers under the lens of Energy-based Models: By reinterpreting robust discriminative classifiers as energy-based models (EBMs), this paper reveals the energy dynamics of adversarial training, proposes an energy-weighted adversarial training method (WEAT), and demonstrates the implicit generative capabilities of robust classifiers.
SMooDi: Stylized Motion Diffusion Model: Introduces SMooDi—the first diffusion model that adapts a pre-trained text-to-motion model for stylized motion generation. Through a style adaptor and dual style guidance (classifier-free guidance + classifier-based guidance), it enables diverse stylized motion generation driven by content text and style motion sequences.
Soft Prompt Generation for Domain Generalization: This paper proposes SPG (Soft Prompt Generation), which introduces generative models to VLM prompt learning for the first time. By dynamically generating instance-specific soft prompts from images via CGAN, it stores domain knowledge in the generative model rather than prompt vectors, achieving superior domain generalization performance.
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models: Proposes SPDInv—a source prompt disentangled inversion method. By modeling the inversion process as a fixed-point search problem and solving it using a pretrained diffusion model, the inverted noise map is disentangled from the source prompt, significantly boosting text-driven image editing quality.
Stable Preference: Redefining Training Paradigm of Human Preference Model for Text-to-Image Synthesis: This work redefines the training paradigm of human preference models for text-to-image generation. By introducing a quality-aware margin mechanism and an anti-interference loss function, the authors address two major issues of traditional cross-entropy training: "blind punishment of image pairs with similar quality" and "lack of robustness to visual perturbations," achieving SOTA performance on prevailing human preference datasets.
StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models: This paper proposes StyleTokenizer, which defines image style as a learnable token embedding to control style generation in diffusion models using a single reference image, while accurately separating content and style.
Text2Place: Affordance-aware Text Guided Human Placement: Proposes Text2Place—the first method for realistic human placement guided by text. It optimizes Gaussian-blob-parameterized semantic masks using Score Distillation Sampling (SDS) loss to learn scene affordances, followed by subject-conditioned inpainting for identity-preserving human placement.
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering: TextDiffuser-2 utilizes two language models for layout planning and layout encoding respectively, achieving more flexible, automated, and diverse visual text rendering, significantly enhancing font style diversity while maintaining text accuracy.
Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation: This paper introduces a new task, Logic-Rich Text-to-Image Generation (Logic-Rich T2I), and constructs the Textual-Visual Logic dataset to evaluate models' capability in handling complex relational descriptions. It proposes a baseline model consisting of three core components: a relation understanding module, a multimodality fusion module, and a negative pair discriminator, significantly improving the quality of image generation from complex logical texts.
The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation: Proposes the Realistic-Fantasy Benchmark (RFBench) to evaluate the performance of diffusion models on creative/knowledge-intensive prompts, and designs a training-free RFNet framework that enhances the generation capability of diffusion models for abstract and imaginative concepts through LLM-assisted prompt interpretation and a semantic alignment assessment module.
Toward Tiny and High-quality Facial Makeup with Data Amplify Learning: A Data Amplify Learning (DAL) paradigm is proposed, which leverages a Diffusion-based Data Amplifier (DDA) to "amplify" and generate a large volume of paired training data from only 5 annotated images. This data is used to train the TinyBeauty model with only 80K parameters, achieving SOTA makeup transfer performance at 460fps on an iPhone 13.
Towards Reliable Advertising Image Generation Using Human Feedback: Constructs a million-scale human-annotated advertising image dataset RF1M, proposes a multimodal RFNet to automatically detect the usability of generated images, and designs the Consistent Condition regularization-driven RFFT fine-tuning method, boosting the advertising image availability rate from 56.4% to 85.5%.
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models: This paper proposes UDiffText, which achieves high-precision and visually harmonious text synthesis in arbitrary images by replacing the CLIP encoder with a lightweight character-level text encoder, fine-tuning cross-attention layers with a local attention loss (based on character segmentation maps) and a Scene Text Recognition (STR) loss, and applying a noised latent refinement step during inference. It outperforms state-of-the-art (SOTA) methods in sequence accuracy (SeqAcc) across various scenarios.
Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement: A universal frequency-disentangled learning paradigm is proposed. By leveraging Laplacian decomposition and low-frequency consistency constraints, it decouples low-frequency (illumination recovery) and high-frequency (denoising) enhancement into two independent sub-tasks. With only 88K additional parameters, it delivers up to 7.68dB PSNR improvement across 6 SOTA low-light enhancement models.
WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation: This work proposes a new task called Web Rendering Parameters Generation (WebRPG), which aims to automatically generate visual presentation parameters (layout, text style, and color) of web elements based on HTML code. By using a VAE to compress the rendering parameters and custom HTML embeddings to capture semantic and hierarchical information, two baseline models (autoregressive and diffusion) are established, where the autoregressive model significantly outperforms the diffusion model and GPT-4.
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models: WildVidFit proposes a video-free training virtual try-on framework. By utilizing an image-based conditional diffusion model and a diffusion guidance module (VideoMAE + DINO-V2), it achieves temporally consistent garment try-on effects in complex in-the-wild videos.
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution: XPSR proposes utilizing high-level and low-level semantic descriptions generated by a Multimodal Large Language Model (LLaVA) as cross-modal priors. These priors are integrated into a diffusion model via Semantic-Fusion Attention, combined with a Degradation-Free Constraint to extract semantic-preserving features, achieving high-fidelity and highly realistic image super-resolution.
You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation: This work proposes YONOS-SR, which trains a Stable Diffusion-based super-resolution model via a Scale Distillation strategy. It achieves state-of-the-art results with only a single DDIM step, accelerating inference by 200 times compared to conventional methods.
Zero-Shot Detection of AI-Generated Images: This paper proposes ZED (Zero-shot Entropy-based Detector), which estimates the probability distribution of each pixel given its context using a lossless image encoder. By using the "level of surprise of an image to a real-image model" as the discriminative feature, it detects images generated by various generators without any AI-generated training data, improving the average accuracy by over 3% compared to the SOTA across a wide range of generative models.
ZigMa: A DiT-style Zigzag Mamba Diffusion Model: ZigMa proposes a DiT-style Zigzag Mamba diffusion model. By employing a heterogeneous layer-wise zigzag scanning scheme, it maintains spatial continuity, achieving superior generation quality compared to Mamba baselines with zero parameter or memory overhead, while retaining the linear complexity advantage over Transformers.
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs: ZipLoRA proposes a cheap and efficient LoRA merging method. By learning column-wise merging coefficients and minimizing the cosine similarity between columns, it achieves hyperparameter-free merging of independently trained subject LoRAs and style LoRAs, generating personalized "any subject × any style" images in diffusion models.