🎨 Image Generation¶

💬 ACL2025 · 9 paper notes

📌 Same area in other venues: 📷 CVPR2026 (492) · 🔬 ICLR2026 (353) · 💬 ACL2026 (5) · 🧪 ICML2026 (141) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (221)

🔥 Top topics: Speech & Audio ×3 · Text-to-Image ×2 · Few-/Zero-Shot Learning ×2

A Unified Agentic Framework for Evaluating Conditional Image Generation: CIGEval is proposed as a unified evaluation framework based on Large Multimodal Model (LMM) Agents. By integrating various tools (Grounding, Highlight, Difference, Scene Graph) and adopting a divide-and-conquer evaluation strategy, it achieves correlation comparable to human annotators (0.4625 vs. human-to-human 0.47) across 7 conditional image generation tasks, and surpasses the SOTA GPT-4o baseline by fine-tuning a 7B model on only 2.3K training samples.
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models: Proposes D-GEN—the first open-source distractor generation model (fine-tuned LLaMA, 8B/70B) that automatically converts open-ended evaluation questions into multiple-choice formats, paired with two evaluation methods (ranking alignment and entropy analysis) to verify distractor quality, maintaining model ranking consistency with Spearman's ρ=0.99 on MMLU.
Planning with Diffusion Models for Target-Oriented Dialogue Systems: DiffTOD models dialogue planning as a trajectory generation problem, leveraging a masked diffusion language model to achieve non-sequential dialogue planning. It designs three guidance mechanisms (word-level/semantic-level/search-based) to flexibly control the dialogue toward the target, significantly outperforming baselines in negotiation, recommendation, and chitchat scenarios.
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation: This paper introduces Rectified Flow to text-to-audio generation. By leveraging bifocal samplers to optimize timestep distribution, immiscible flow to minimize total data-noise distance, and anchored optimization to correct CFG guidance errors, the proposed method achieves single-step generation with a FAD of 1.49, outperforming 100-step diffusion models while reaching a generation speed of 400x real-time.
Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models: Math2Visual proposes a framework to automatically generate pedagogical visualizations from textual descriptions of math word problems (MWPs). It defines a visual language and design space based on teacher interviews, constructs a labeled dataset of 1,903 images, and evaluates and fine-tunes multiple TTI models, revealing key deficiencies of current models in representing mathematical relationships.
Multimodal Pragmatic Jailbreak on Text-to-image Models: This paper proposes a new type of attack called "Multimodal Pragmatic Jailbreak" (MPJ). By generating images containing visual text through T2I models, the image content and text content are safe when evaluated individually but yield unsafe content when combined. This study reveals that all tested models, including DALL·E 3, are vulnerable to this attack.
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching: This paper proposes OZSpeech, the first zero-shot TTS system that combines Optimal Transport Conditional Flow Matching (OT-CFM) with a learned prior distribution to achieve one-step sampling. It significantly outperforms existing approaches in content accuracy (WER), inference speed, and model size.
R-VC: Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching: R-VC is the first zero-shot voice conversion system to achieve rhythm control. It models the target speaker's rhythm style using a Mask Transformer duration model, combined with a Shortcut Flow Matching DiT decoder to achieve efficient and high-quality speech generation in only 2 sampling steps, achieving a WER of 3.51 and speaker similarity of 0.930 on LibriSpeech.
Synthia: Novel Concept Design with Affordance Composition: Synthia proposes a novel concept design framework based on affordance composition. By leveraging a hierarchical concept ontology, an affordance sampling strategy, and curriculum learning to fine-tune a T2I model, it generates innovative designs that are both visually novel and functionally coherent.