🎨 Image Generation¶

💬 ACL2026 · 11 paper notes

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce: This paper proposes the AFMRL framework, which formulates fine-grained product understanding in e-commerce as an attribute generation task. An MLLM generates key attributes to enhance contrastive learning (AGCL), while retrieval performance serves as a reward signal to inversely optimize the attribute generator (RAR), achieving state-of-the-art retrieval performance on large-scale e-commerce datasets.
BookAgent: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration: BookAgent is a safety-aware multi-agent framework that generates high-quality, character-consistent, and content-safe picture books end-to-end from user drafts through a three-stage closed-loop architecture: Value-Aligned Storyboard (VAS) + Iterative Cross-Modal Refinement (ICR) + Temporal Cognitive Calibration (TCC).
CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment: This paper proposes CoDial, a framework that converts predefined dialogue flows (task schemas) into structured heterogeneous graphs and automatically generates LLM guardrail code (e.g., Colang), achieving interpretable and controllable task-oriented dialogue policies at inference time. It reaches SOTA on the STAR benchmark without requiring training data.
ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling: This paper proposes ControlAudio, a unified progressive diffusion modeling framework that achieves three capabilities—text-guided generation, precise temporal control, and intelligible speech synthesis—within a single diffusion model through three-stage progressive training (TTA pretraining → temporal control fine-tuning → joint temporal+intelligible speech training) and progressive guidance sampling, significantly outperforming existing methods in temporal precision and speech intelligibility.
Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models: This paper systematically investigates token-level information distribution in text encoder outputs of text-to-image models through a causal intervention framework, discovering that lexical item semantics are typically concentrated in 1-2 representative tokens, and that cross-item information flow leads to semantic leakage and image misinterpretation in 11% of cases. The paper proposes simple yet effective token-level intervention methods to improve alignment.
From Past To Path: Masked History Learning for Next-Item Prediction in Generative Recommendation: This paper proposes Masked History Learning (MHL), a training framework that introduces masked history reconstruction as an auxiliary task alongside autoregressive training in generative recommendation. By combining entropy-guided adaptive masking strategies and curriculum learning schedulers, the model shifts from merely predicting "what's next" to understanding "why this path formed," significantly outperforming SOTA on three datasets.
Investigating Counterfactual Unfairness in LLMs towards Identities through Humor: This paper systematically investigates counterfactual unfairness in LLMs through humor scenarios—observing behavioral changes after swapping speaker/listener identities. Results reveal that jokes told by privileged-group speakers are refused at a rate as high as 67.5%, are judged as malicious with 64.7% higher probability, and receive social harm scores up to 1.5 points (on a 5-point scale), demonstrating that models have internalized fixed social privilege hierarchies rather than performing genuine social reasoning.
Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions: This paper presents the first large-scale systematic audit of the native sampling capability of 11 frontier LLMs across 15 probability distributions, demonstrating that LLMs severely lack intrinsic probabilistic sampling mechanisms and that this deficiency propagates into downstream applications as systematic bias.
MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization: This paper proposes MASH (Multi-stage Style Humanization), a three-stage pipeline consisting of style-injection SFT → DPO alignment → inference-time refinement, which trains a rewriter with only 0.1B parameters to evade AI-generated text detectors in a black-box setting with an average attack success rate of 92%, while maintaining high linguistic quality.
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval: This paper proposes Visualize-then-Retrieve (VisRet), a novel retrieval paradigm that first visualizes a text query into images via a T2I generative model and then performs retrieval within the image modality. VisRet achieves an average nDCG@30 improvement of 0.125 (CLIP) and 0.121 (E5-V) across four benchmarks, and improves downstream VQA accuracy by 15.7% on Visual-RAG-ME.
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching: This paper proposes ZipVoice-Dialog, the first non-autoregressive zero-shot spoken dialogue generation model based on flow matching. Through two simple designs—curriculum learning and speaker-turn embeddings—it addresses the unintelligible speech and turn confusion problems that arise when flow matching is directly applied to dialogue scenarios. The paper also releases OpenDialog (6.8k hours), the first large-scale open-source spoken dialogue dataset.