🗣️ Dialogue Systems¶

🔬 ICLR2026 · 5 paper notes

AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions: This paper proposes AQuA, the first visual question answering dataset with fine-grained ambiguity grading across four levels (7.2K samples, 1.8K per level), defining an optimal response strategy for each level (direct answer / inference / enumeration / clarification request). The study finds that GPT-5 and Gemini over-confidently default to direct answers on ambiguous VQA instances, while a 3B model trained via SFT+GRPO surpasses closed-source large models in strategy adaptation.
Non-Collaborative User Simulators for Tool Agents: Drawing on four categories of non-collaborative user behavior from marketing research (unavailable service requests, tangential chit-chat, impatience, and incomplete utterances), this work constructs a goal-aligned simulation framework and systematically exposes the behavior-specific failure mechanisms of state-of-the-art tool agents on MultiWOZ and τ-bench. Tangential chit-chat causes an average success rate (SR) drop of 29.1%, and distinct model families exhibit qualitatively different failure modes—GPT-series models fall into repetitive helper API calls, while Qwen-series models tend to hallucinate API results.
ReIn: Conversational Error Recovery with Reasoning Inception: This paper proposes Reasoning Inception (ReIn), a test-time intervention method that requires no modification to model parameters or system prompts. An external inception module detects conversational errors and injects recovery plans into the task agent's reasoning chain, significantly improving task completion rates across diverse error scenarios while generalizing to unseen error types.
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation: FlyThinker proposes an efficient "think-while-generating" framework that employs a dedicated reasoning model (Reasoner) to generate latent reasoning signals in parallel at the token level, dynamically incorporating them into a generation model (Generator) to guide personalized long-form generation, while preserving both training and inference efficiency.
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding: By contrasting layer-wise hidden representations (chain-of-embedding) with and without visual input, this paper identifies a "Visual Integration Point" (VIP) layer in LVLMs and proposes the Total Visual Integration (TVI) metric to quantify the strength of language priors.