🎨 Image Generation¶

🤖 AAAI2026 · 78 paper notes

AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction: This paper proposes a training-free image attribution method based on the ratio of autoencoder double-reconstruction losses. By incorporating image uniformity calibration to eliminate texture complexity bias, the method achieves an average accuracy of 95.1% across 8 mainstream diffusion models, surpassing the strongest baseline by 24.7%, while being approximately 100× faster.
Aggregating Diverse Cue Experts for AI-Generated Image Detection: This paper proposes the Multi-Cue Aggregation Network (MCAN), which unifies three complementary cues — raw image, high-frequency representation, and a newly introduced Chromaticity Inconsistency (CI) — through a Mixture-of-Encoder Adapter (MoEA), enabling robust AI-generated image detection that generalizes across diverse generative models.
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation: This paper proposes Cool-SD, a theoretically grounded annealed relaxation framework for speculative decoding. By deriving a tight upper bound on the TV distance, it obtains the optimal resampling distribution and proves that a decreasing acceptance probability schedule yields smaller distributional shift than a uniform schedule. Cool-SD achieves a superior speed–quality trade-off over LANTERN++ on LlamaGen and Lumina-mGPT.
AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer: This work formulates zero-shot anomaly generation as a text-guided localized style transfer problem. A lightweight U-Net trained with CLIP-based losses stylizes masked regions of normal images into semantically aligned anomalous images. With only 263M total parameters (0.61M trainable), AnoStyler surpasses diffusion-based baselines on MVTec-AD and VisA while significantly improving downstream anomaly detection performance.
Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines: This paper exposes a backdoor vulnerability in the ControlNet conditional branch: injecting as little as 1–5% poisoned data suffices to implant a backdoor without modifying the diffusion backbone. Upon trigger activation, the model ignores text prompts and generates attacker-specified content. Clean fine-tuning (CFT) is proposed as a practical defense.
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images: This paper identifies a novel threat of NSFW text embedded in diffusion-model-generated images, proposes NSFW-Intervention — a targeted LoRA fine-tuning method applied to text-rendering layers — and releases the ToxicBench benchmark.
Beyond Semantic Features: Pixel-Level Mapping for Generalized AI-Generated Image Detection: This paper proposes a pixel-level mapping preprocessing method that suppresses low-frequency semantic bias and enhances high-frequency generation artifacts by breaking the monotonic ordering of pixel values, achieving a cross-model generalization accuracy of 98.4% in AI-generated image detection.
Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra: This paper proposes GLMR, a two-stage framework (contrastive pre-retrieval + generative language model re-ranking) that transforms cross-modal retrieval into unimodal retrieval by generating molecular structures aligned with input mass spectra, achieving over 40% improvement in Recall@1 on MassSpecGym.
CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement: CAD-VAE introduces a correlation-aware latent code to capture shared information between target and sensitive attributes, achieves disentanglement by directly minimizing conditional mutual information, and employs a relevance-driven optimization strategy to precisely regulate the shared code, attaining state-of-the-art performance on fair representation learning, counterfactual generation, and fair image editing.
CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images: CausalCLIP is proposed to disentangle CLIP features into causal and non-causal subspaces via Gumbel-Softmax masking and HSIC constraints, combined with adversarial masking and counterfactual intervention to preserve stable forensic cues, achieving a 6.83% accuracy improvement in cross-generator generalization.
Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition: This paper proposes CD3T, a two-level hierarchical MARL framework that employs a conditional diffusion model to learn action semantic representations \(z_a^i\) (conditioned on observations and other agents' actions to predict next observations and rewards), obtains subtask partitions via k-means clustering, and uses a high-level subtask selector combined with a low-level policy operating over a restricted action space. CD3T significantly outperforms all baselines on Super Hard scenarios in SMAC.
Constrained Particle Seeking: Solving Diffusion Inverse Problems with Just Forward Passes: This paper proposes Constrained Particle Seeking (CPS), a gradient-free method for solving diffusion model inverse problems. CPS constructs a locally linear surrogate of the forward process using all candidate particles, then seeks the optimal particle under a hyperspherical constraint within the high-density region of the transition kernel, achieving performance competitive with gradient-based methods.
Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution: DegFlow is proposed to learn continuous degradation trajectories from discrete-scale real HR-LR pairs via a residual autoencoder and latent space Flow Matching. Given only a single HR image at inference, the model synthesizes realistic LR images at arbitrary continuous scales for training super-resolution models, achieving state-of-the-art performance.
Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy: This paper formalizes copyright infringement from the perspective of Differential Privacy (DP), and proposes the D-Plus-Minus (DPM) framework. By fine-tuning diffusion models in two opposing directions—"learning" and "unlearning"—DPM measures conditional sensitivity differences to perform post-hoc detection of copyright infringement in text-to-image models.
CountSteer: Steering Attention for Object Counting in Diffusion Models: This paper proposes CountSteer, a training-free inference-time method that injects adaptive steering vectors into the cross-attention hidden states of diffusion models, improving object counting accuracy by approximately 4% without degrading image quality.
Creating Blank Canvas Against AI-Enabled Image Forgery: This paper proposes a "blank canvas" mechanism that applies adversarial perturbations to make SAM "see nothing" in protected images. When a protected image is tampered with, the tampered regions disrupt the perturbations and become automatically detectable by SAM, enabling proactive tampering localization without requiring any tampered training data.
DICE: Distilling Classifier-Free Guidance into Text Embeddings: This paper proposes DICE, which trains a lightweight sharpener with only 2M parameters to distill the guidance effect of CFG into text embeddings, enabling guidance-free sampling to achieve generation quality on par with CFG while halving inference computation. The method is comprehensively validated across multiple variants of SD1.5, SDXL, and PixArt-α, and is accepted as an AAAI 2026 Oral presentation.
Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation: This paper proposes Diff-V2M, a hierarchical conditional diffusion Transformer framework for video-to-music generation that integrates affective, semantic, and rhythmic features via explicit rhythmic modeling (low-resolution ODF) and a hierarchical cross-attention mechanism, achieving state-of-the-art performance on both in-domain and out-of-domain datasets.
DiffA: Large Language Diffusion Models Can Listen and Understand: This paper proposes DIFFA — the first large audio-language model built upon a diffusion language model — which combines a frozen LLaDA-8B backbone with a lightweight dual-adapter architecture and a two-stage training pipeline. Using only 960 hours of ASR data and 127 hours of synthetic instruction data, DIFFA achieves competitive performance against autoregressive baselines on MMSU, MMAU, and VoiceBench.
Difficulty Controlled Diffusion Model for Synthesizing Effective Training Data: A difficulty encoder (MLP taking class label and difficulty score as input) is incorporated into Stable Diffusion, with LoRA fine-tuning used to decouple the objectives of "domain alignment" and "difficulty control," enabling controllable learning difficulty in synthesized data. Using only 10% additional synthetic data, the proposed method surpasses the best results of Real-Fake while saving 63.4 GPU hours.
Diffusion Reconstruction-Based Data Likelihood Estimation for Core-Set Selection: This paper proposes using the partial reverse denoising reconstruction bias of diffusion models as a theoretically grounded approximation of data likelihood, combined with information bottleneck theory for optimal reconstruction timestep selection, enabling distribution-aware core-set selection that achieves near-full-dataset training performance on ImageNet with only 50% of the data.
DogFit: Domain-guided Fine-tuning for Efficient Transfer Learning of Diffusion Models: This paper proposes DogFit, which internalizes Domain Guidance (DoG) into the fine-tuning loss of diffusion models, enabling the model to learn the guidance direction during training. At inference time, a controllable fidelity–diversity trade-off is achieved without double forward passes, surpassing the state-of-the-art guidance methods on 6 target domains with half the sampling TFLOPS.
DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation: This paper identifies four failure scenarios in multi-object generation (similar shapes/textures, dissimilar background biases, many objects), constructs directional separation vectors to modify three types of CLIP text embeddings (semantic token / EOT / pooled), achieves a 16–25% improvement in success rate and a 3–12% reduction in mixing rate on SDXL, with inference speed close to baseline (~4× faster than Attend-and-Excite).
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding: This paper proposes EchoGen, a unified framework for layout-to-image generation (L2I) and image-to-layout grounding (I2L), trained through a progressive three-stage pipeline — parallel pre-training → dual-task joint optimization → cycle reinforcement learning (CycleRL) — which leverages the layout→image→layout cycle consistency as a self-supervised reward, achieving state-of-the-art results on MS-COCO and LayoutSAM.
EfficientFlow: Efficient Equivariant Flow Policy Learning for Embodied AI: This paper proposes EfficientFlow, which incorporates equivariance into the Flow Matching policy learning framework. It theoretically proves that an isotropic prior combined with an equivariant velocity network guarantees an equivariant action distribution, and introduces Flow Acceleration Upper Bound (FABO) regularization to accelerate sampling. On 12 tasks from MimicGen, EfficientFlow achieves 20–56× faster inference than EquiDiff with superior performance.
Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective: This paper proposes RetSimd, which "replays the whole story" by segmenting text and generating a series of supplementary images via a text-to-image model, combined with a graph neural network to fuse multi-image relationships. The approach significantly enhances the contribution of the image modality to misinformation detection, consistently improving the performance of five SOTA methods across three benchmark datasets.
Exposing DeepFakes via Hyperspectral Domain Mapping: This paper proposes HSI-Detect, a two-stage deepfake detection framework that first reconstructs RGB images into 31-channel hyperspectral images to amplify spectral artifacts introduced by generative models, then performs detection in the hyperspectral domain, achieving a mean AUC of 68.92% on cross-manipulation generalization benchmarks on FaceForensics++, surpassing RGB-only baselines.
FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction: This paper is the first to introduce Hausdorff Dimension (HD) into Fractal Generative Models (FGM), proposing a learnable HD estimation module, a Monotonic Momentum-Driven Scheduling strategy (MMDS), and HD-guided rejection sampling. The method achieves a 39% improvement in generation diversity (Recall) on ImageNet while maintaining image quality.
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment: This paper proposes R-REPA (Reverse Representation Alignment), which creatively exploits the invertibility of Normalizing Flows to align intermediate features with visual foundation models along the generative (reverse) path. It further introduces a training-free classification algorithm, achieving new state-of-the-art results for normalizing flows on ImageNet 64×64 and 256×256 with a 3.3× training speedup.
FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting: This paper proposes FreeInpaint, a plug-and-play, training-free method that optimizes the initial noise to steer attention toward the inpainting region (PriNo), and during denoising decomposes the conditional distribution into three guidance terms — text alignment, visual rationality, and human preference (DeGu) — simultaneously improving prompt alignment and visual rationality in image inpainting.
GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution: This paper proposes GEWDiff, a geometric enhanced wavelet-based diffusion model that efficiently compresses hyperspectral data into a latent space via a wavelet encoder-decoder, introduces edge-aware noise scheduling and mask-conditional control to preserve geometric integrity, and designs a multi-level loss function to facilitate stable convergence, achieving state-of-the-art performance on 4× hyperspectral image super-resolution.
HACK: Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling: This paper identifies that attention heads in VAR models naturally fall into two categories — Contextual Heads (semantic consistency, vertical attention patterns) and Structural Heads (spatial coherence, multi-diagonal patterns) — and proposes the HACK framework, which employs asymmetric budget allocation and pattern-specific compression strategies to achieve lossless generation quality at 70% compression, yielding 1.75× memory reduction and 1.57× speedup on Infinity-8B.
HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models: This paper proposes HierarchicalPrune, which exploits the hierarchical functional differences among blocks in MMDiT-based diffusion models—early blocks establish semantic structure while late blocks refine texture details—and combines three techniques: Hierarchical Position Pruning (HPP), Positional Weight Preservation (PWP), and Sensitivity-Guided Distillation (SGDistill), together with INT4 quantisation. Applied to SD3.5 Large Turbo (8B), the method compresses the model from 15.8 GB to 3.24 GB (79.5% memory reduction) with only a 4.8% degradation in image quality.
How Bias Binds: Measuring Hidden Associations for Bias Control in Text-to-Image Compositions: This work is the first to investigate compositional semantic binding bias in text-to-image generation. It proposes the Bias Adherence Score (BA-Score) to quantify how object–attribute binding activates bias, and introduces a training-free Context-Bias Control (CBC) framework that achieves over 10% debiasing improvement in compositional generation via token embedding decoupling and residual injection.
Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval: This paper proposes H2ARN, which embeds text and 3D point cloud data in the Lorentz hyperbolic space. It addresses hierarchical representation collapse via a hierarchical ordering loss (entailment cones), and mitigates redundancy-induced saliency dilution via contribution-aware hyperbolic aggregation. The method achieves state-of-the-art performance on Text-3D retrieval and introduces the T3DR-HIT v2 dataset, which is 2.6× larger than its predecessor.
Improved Masked Image Generation with Knowledge-Augmented Token Representations: This paper proposes KA-MIG, a framework that mines three types of token-level semantic prior knowledge graphs from training data (co-occurrence graph, semantic similarity graph, and position-token incompatibility graph), learns augmented token representations via a graph-aware encoder, and injects them into existing MIG models through a lightweight addition-subtraction fusion mechanism, consistently improving generation quality across multiple backbone networks.
Infinite-Story: A Training-Free Consistent Text-to-Image Generation: Built upon a scale-wise autoregressive model (Infinity), this work introduces three training-free techniques—Identity Prompt Replacement (eliminating contextual bias in the text encoder), Adaptive Style Injection (reference image feature injection), and Synchronized Guidance Adaptation (synchronizing both branches of CFG)—to achieve identity- and style-consistent multi-image generation at 6× the speed of diffusion-based methods (1.72 s/image).
Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers: Laytrol achieves high-quality layout-to-image generation on FLUX by initializing the layout control network via parameter copying from MM-DiT, adopting a dedicated initialization scheme (layout encoder initialized as a pure text encoder with zero-initialized outputs), and constructing the LaySyn dataset using FLUX-generated images to mitigate distribution shift.
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs: This paper presents the first systematic study of long-context capabilities in diffusion large language models (diffusion LLMs), revealing stable perplexity under direct extrapolation and a "local awareness" phenomenon. It further proposes LongLLaDA, a training-free method that successfully extends the context window by 6× (to 24k tokens) via NTK-based RoPE extrapolation.
LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations: This paper proposes LongT2IBench, the first evaluation benchmark targeting long-text-to-image (T2I) alignment, comprising 14K long-text–image pairs with graph-structured human annotations. It further introduces LongT2IExpert, an evaluator built by fine-tuning an MLLM via Hierarchical Alignment Chain-of-Thought (HA-CoT) instruction tuning, which jointly produces alignment scores and structured explanations.
MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models: This paper proposes MacPrompt, a black-box cross-lingual attack method that translates harmful words into multi-language candidates and performs character-level recombination to construct "macaronic words" as adversarial prompts. The method simultaneously bypasses text safety filters and concept removal defenses, achieving attack success rates of up to 92% on sexual content and 90% on violent content.
MACS: Multi-source Audio-to-Image Generation with Contextual Significance and Semantic Alignment: This paper proposes MACS, the first two-stage framework that explicitly separates multi-source audio prior to image generation. The framework combines weakly supervised sound source separation, CLAP-space semantic alignment (via ranking loss and contrastive loss), and decoupled cross-attention diffusion generation, achieving comprehensive state-of-the-art performance on multi-source, mixed-source, and single-source audio-to-image generation tasks.
Mass Concept Erasure in Diffusion Models with Concept Hierarchy: This paper proposes a grouped erasure strategy based on supertype-subtype concept hierarchy and Supertype-Preserving LoRA (SuPLoRA). By freezing the down-projection matrix (orthogonal to the supertype subspace) and training only the up-projection matrix, the method achieves an optimal balance between erasure effectiveness and generation quality in large-scale, multi-domain concept erasure.
MDiff4STR: Mask Diffusion Model for Scene Text Recognition: This work is the first to introduce Mask Diffusion Models (MDM) into Scene Text Recognition (STR), proposing MDiff4STR. It addresses the training-inference noising gap via six training mask strategies and resolves overconfident predictions through a Token Replacement Noise mechanism. With only 3 denoising steps, MDiff4STR surpasses state-of-the-art autoregressive models in accuracy while achieving 3× inference speedup.
Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models: Through systematic probing analysis of attention maps in diffusion models, this work reveals that self-attention maps are critical for preserving the temporal structure of music. Based on this finding, Melodia is proposed — a training-free music editing method that achieves an optimal balance between attribute modification and structural preservation by selectively manipulating self-attention maps.
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution: This paper introduces the sparse Mixture-of-Experts (MoE) paradigm into real-world image super-resolution, proposing a Mixture-of-Ranks (MoR) architecture that treats each LoRA rank as an independent expert. Combined with a CLIP-driven degradation estimation module and a degradation-aware load balancing loss, the method achieves one-step high-fidelity super-resolution reconstruction.
MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation: This work introduces the MeanFlow paradigm to the robot learning domain for the first time. By incorporating 3D point cloud inputs and a Dispersive Loss, MP1 generates action trajectories in a single network forward pass (1-NFE), achieving state-of-the-art success rates with an inference latency of only 6.8 ms on robotic manipulation tasks.
Multi-Aspect Cross-modal Quantization for Generative Recommendation: This paper proposes MACRec, which introduces multi-aspect cross-modal interaction at both the semantic ID learning stage and the generative model training stage. Through cross-modal quantization (contrastive learning-enhanced residual quantization) and multi-aspect alignment (implicit + explicit), MACRec significantly improves recommendation performance while reducing ID collision rates.
Multi-Metric Preference Alignment for Generative Speech Restoration: This paper proposes a Multi-Metric Preference Alignment strategy that constructs a preference dataset, GenSR-Pref (80K pairs), requiring unanimous agreement across multiple complementary metrics. DPO is applied to post-training alignment of three generative speech restoration paradigms (AR, MGM, FM), achieving substantial quality improvements while effectively mitigating reward hacking.
ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning: This paper studies online distributionally robust reinforcement learning and proposes the RVI-\(f\) algorithm based on \(f\)-divergence uncertainty sets. It achieves near minimax-optimal regret bounds under both \(\chi^2\) and KL divergences without relying on any structural assumptions.
PADiff: Predictive and Adaptive Diffusion Policies for Ad Hoc Teamwork: This work is the first to apply diffusion models to the Ad Hoc Teamwork (AHT) problem. The proposed PADiff framework achieves real-time adaptation to dynamic teammates via an Adaptive Feature Modulation Net (AFM-Net), and injects teammate intent prediction into the denoising process through a Predictive Guidance Block (PGB), achieving an average improvement of 35.25% over existing methods in multimodal cooperative scenarios.
PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement: This paper proposes PASE, a framework that leverages robust phonological priors embedded in pretrained WavLM via Denoising Representation Distillation (DRD) to suppress linguistic hallucinations, while employing a dual-stream representation (high-level phonetic + low-level acoustic) to eliminate acoustic hallucinations, simultaneously achieving state-of-the-art performance in both perceptual quality and content fidelity.
Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback: This paper proposes a DiT-based audio-driven human video generation framework built on Wan2.1, featuring a LoRA training strategy for long video generation, partial parameter updates combined with DPO reward feedback to enhance lip synchronization and motion naturalness, and a novel training-free Mask-CFG method that enables multi-character (≥3 persons) audio-driven animation for the first time.
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration: This paper proposes ProCache, a training-free dynamic feature caching framework that achieves 2.90× speedup on DiT-XL/2 and 1.96× speedup on PixArt-α with negligible image quality degradation, through constraint-aware non-uniform caching pattern search and selective computation, significantly outperforming existing caching methods.
QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution: This paper proposes QuantVSR, the first low-bit (4/6-bit) post-training quantization framework for diffusion-based video super-resolution (VSR). It introduces a Spatiotemporal Complexity-Aware (STCA) mechanism for layer-adaptive rank allocation and a Learnable Bias Alignment (LBA) module to mitigate quantization bias. Under the 4-bit setting, QuantVSR achieves 84.39% parameter compression and 82.56% computation reduction while maintaining performance comparable to the full-precision model.
ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment: This paper proposes ReAlign (Reward-guided sampling Alignment), which employs a step-aware reward model and a reward-guided sampling strategy to dynamically steer sampling trajectories toward distributions with high text-motion alignment during diffusion inference, significantly improving the generation quality of various motion generation methods without fine-tuning any diffusion model. Using MLD as a baseline, R@1 improves by 17.9% and FID improves by 58.8%.
Realism Control One-step Diffusion for Real-World Image Super-Resolution: This paper proposes the RCOD framework, which endows one-step diffusion (OSD) super-resolution methods with the ability to flexibly control the fidelity–realism trade-off at inference time via a latent domain grouping strategy and degradation-aware sampling. A visual prompt injection module is also introduced to replace text prompts, improving restoration accuracy.
Realistic Face Reconstruction from Facial Embeddings via Diffusion Models: This paper proposes the FEM (Face Embedding Mapping) framework, which employs a KAN-based network to map embeddings from arbitrary face recognition (FR) or privacy-preserving face recognition (PPFR) systems into the embedding space of a pretrained identity-preserving (ID-Preserving) diffusion model, enabling high-resolution realistic face reconstruction for evaluating privacy leakage risks in FR systems.
Rectified Noise: A Generative Model Using Positive-incentive Noise: This paper proposes Rectified Noise (ΔRN), which leverages the positive-incentive noise (π-noise) framework to learn a set of beneficial noise signals and inject them into the velocity field of a pretrained Rectified Flow model, achieving a reduction in FID from 10.16 to 9.05 on ImageNet-1k with only 0.39% additional parameters.
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers: This paper proposes the RelaCtrl framework, which quantifies the sensitivity of each DiT layer to control information via a ControlNet Relevance Score, and uses this analysis to guide the placement and modeling capacity of control blocks. A Two-Dimensional Shuffle Mixer (TDSM) is introduced to replace self-attention and FFN, achieving controllable generation quality superior to PixArt-δ with only 15% of its parameters and computational cost.
RetrySQL: Text-to-SQL Training with Retry Data for Self-Correcting Query Generation: This paper proposes the RetrySQL training paradigm, which injects retry data (erroneous steps + [BACK] token + correct steps) into reasoning chains during continual pre-training of small encoder models. This approach enables a 1.5B open-source model to acquire self-correction capabilities, achieving improvements of up to 4 and 3.93 percentage points in overall execution accuracy on the BIRD and SPIDER benchmarks, respectively, with gains of up to 9 percentage points on challenging samples.
Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation: This paper systematically investigates fundamental deficiencies in compositional fidelity of text-to-image (T2I) models, focusing on three basic primitives—negation, counting, and spatial relations. It reveals a "submultiplicative" interference phenomenon in which models perform adequately on individual primitives but suffer dramatic performance degradation under joint composition, attributing this to training data scarcity, the unsuitability of continuous attention architectures for discrete logic, and evaluation metrics biased toward visual plausibility rather than constraint satisfaction.
Self-NPO: Data-Free Diffusion Model Enhancement via Truncated Diffusion Fine-Tuning: This paper proposes Self-NPO, a negative preference optimization method that requires neither external data annotation nor reward models. By leveraging Truncated Diffusion Fine-Tuning (TDFT), the model learns "what is bad" from its own low-quality generated data, and uses CFG to steer generation away from undesirable outputs. Self-NPO achieves comparable performance to Diffusion-NPO at less than 1% of the training cost.
SimDiff: Simpler Yet Better Diffusion Model for Time Series Point Forecasting: This paper proposes SimDiff — the first purely end-to-end diffusion model to achieve state-of-the-art performance on time series point forecasting. A unified Transformer network serves simultaneously as denoiser and predictor. Combined with Normalization Independence for distribution shift handling and a Median-of-Means ensemble strategy that converts probabilistic samples into precise point predictions, SimDiff achieves 1st place on 6 and 2nd place on 3 out of 9 benchmarks.
SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation: This paper proposes SpecDiff, a training-free multi-level feature caching strategy based on self-speculation. By leveraging a small number of speculative steps to introduce future information for token importance selection, SpecDiff overcomes the accuracy–speed bottleneck of methods that rely solely on historical information, achieving 2.80×/2.74×/3.17× speedup on Stable Diffusion 3/3.5 and FLUX with negligible quality loss.
Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering: This paper proposes Latent Space Filtering (LSF), a method that analyzes the degradation of low-dimensional structure in the latent representations of self-consuming diffusion models and uses confidence scores from a probing classifier to filter low-quality synthetic data. Under a fixed training budget, LSF effectively mitigates model collapse without requiring additional real data or an enlarged training set.
Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression: This paper proposes SODEC, a one-step diffusion-based image compression model that injects the prior of a high-fidelity VAE decoder into the diffusion generation process via a Fidelity Guidance Module (FGM). Combined with a rate annealing training strategy, SODEC achieves high-quality compression at extremely low bitrates, with decoding speed more than 20× faster than multi-step diffusion methods, while reaching state-of-the-art rate-distortion-perception trade-offs.
Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model: This paper proposes SOLD, a framework that integrates a latent diffusion model (LDM) with reinforcement learning (RL) via a step-wise single-step sampling optimization strategy to directly optimize non-differentiable structural metrics in RNA inverse folding — including secondary structure similarity (SS), minimum free energy (MFE), and LDDT — achieving comprehensive improvements over existing methods across multiple metrics.
Studying Classifier(-Free) Guidance From A Classifier-Centric Perspective: Through systematic empirical study, this paper reveals the essential mechanism underlying both classifier guidance and classifier-free guidance — both steer denoising trajectories away from the classifier's decision boundary to achieve conditional generation — and proposes a flow matching-based post-processing method that validates this "classifier-centric" perspective on high-dimensional data.
T-LoRA: Single Image Diffusion Model Customization Without Overfitting: This paper proposes T-LoRA, a timestep-dependent low-rank adaptation framework that addresses overfitting in single-image diffusion model customization. The framework dynamically adjusts the effective LoRA rank across diffusion timesteps (smaller rank at high-noise timesteps, larger rank at low-noise timesteps) and employs orthogonal initialization (Ortho-LoRA) via random matrix SVD to ensure information independence among adaptation components, achieving an optimal balance between concept fidelity and text alignment.
T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model: This paper constructs T2I-RiskyPrompt — a comprehensive benchmark comprising 6,432 valid risky prompts spanning 6 major categories and 14 subcategories, each annotated with hierarchical labels and detailed risk rationales. A reason-driven MLLM-based risk detection method is proposed (achieving 91.8% accuracy with a 3B model), and a systematic evaluation is conducted across 8 T2I models, 9 defense methods, 5 safety filters, and 5 attack strategies.
Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances: This paper proposes VALOR, a validation-aware multimodal expert framework combining a multi-expert routing architecture with Chain-of-Thought reasoning and a semantic alignment validation mechanism, which achieves joint fine-grained classification of complaint Aspect and Severity in multi-turn multimodal customer service dialogues, yielding absolute improvements of 12.94%/6.51% over the strongest baseline Gemma-3.
Targeted Data Protection for Diffusion Model by Matching Training Trajectory: TAFAP achieves, for the first time, effective Targeted Data Protection (TDP) for diffusion models by generating adversarial perturbations via training trajectory matching, redirecting unauthorized fine-tuning outputs toward a user-specified target concept while maintaining high image quality.
TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs: This paper proposes TruthfulRAG, a framework that, for the first time, leverages knowledge graphs (KGs) to resolve conflicts between retrieved knowledge and LLM parametric knowledge at the factual level in RAG systems. The framework improves generation accuracy and trustworthiness through triple extraction, query-aware graph retrieval, and an entropy-based conflict filtering mechanism.
TSGDiff: Rethinking Synthetic Time Series Generation from a Pure Graph Perspective: This paper proposes TSGDiff, the first framework to rethink time series generation from a purely graph-based perspective. Time series are represented as dynamic graphs constructed from Fourier spectral features, diffusion modeling is performed in the graph latent space, and a novel Topo-FID metric is introduced to evaluate the structural fidelity of generated time series.
UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective: This paper proposes UNSEEN, a dataset pruning method that improves coreset selection from a generalization perspective—considering not only how retained samples contribute to training loss, but also how they contribute to test-time generalization. UNSEEN selects coresets that better align the training distribution with unseen test distributions.
VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning: VoiceCloak is a proactive defense framework against diffusion-based voice cloning that simultaneously achieves speaker identity obfuscation and perceptual quality degradation via four-dimensional adversarial perturbations, attaining a DSR of 71.4% on LibriTTS and substantially outperforming all existing defense methods.
X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning: A 3.7M high-quality editing dataset covering 14 task categories is constructed, and a lightweight (0.9B parameter) plug-and-play editing module based on Task-Aware MoE-LoRA and Contrastive Learning is proposed, achieving performance comparable to 12B fully fine-tuned models.