🗣️ Dialogue Systems¶
🔬 ICLR2026 · 10 paper notes
📌 Same area in other venues: 📷 CVPR2026 (1) · 💬 ACL2026 (26) · 🧪 ICML2026 (5) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (8)
🔥 Top topics: Dialogue ×2 · Reasoning ×2
- AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
-
The authors propose AQuA, the first VQA dataset (7.2K samples) with fine-grained ambiguity levels (4 levels), defining optimal response strategies for each level (Direct Answer/Inference/Enumeration/Request Clarification). The study finds that GPT-5 and Gemini are overconfident, consistently providing direct answers to ambiguous questions. Conversely, a 3B model trained via SFT+GRPO can surpass the strategy adaptation capabilities of closed-source large models.
- ClarifyVC: Clarifying Ambiguous Commands in Vehicle Control with a Hybrid Data Augmentation Pipeline
-
ClarifyVC employs an agent-orchestrated four-stage data augmentation pipeline to "grow" a large volume of ambiguity-rich and protocol-compliant single/multi-turn dialogues from 20,000 real in-vehicle commands. Accompanied by a three-tier evaluation protocol and a Data Quality Score (DQS), fine-tuning on this data improves parsing accuracy by ~15%, ambiguity resolution by ~20%, and achieves 98% protocol compliance for in-vehicle voice commands.
- Codified Finite-state Machines for Role-playing
-
Addressing the issue where LLMs in role-playing only mimic surface-level actions but fail to remember a character's "internal state," this paper proposes automatically compiling character profiles into executable Finite State Machines (CFSM). It uses code to explicitly record character states and transition rules, further extending this to CPFSM for modeling states via probability distributions. On both synthetic validation and Fandom real-plot benchmarks, it demonstrates superior coherence and interpretability compared to prompt-only state modeling baselines.
- DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning
-
DRIFT treats the abundant but implicit "user dissatisfaction" (DSAT) from real-world deployments as high-quality negative anchors. Positive samples are dynamically sampled from the current policy, and iterative training is performed using standard DPO. Without requiring human annotations, reward models, or positives generated by stronger models, it enables a 14B model to outperform GPT-4o-mini on WildBench.
- Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
-
This paper proposes an extremely fast robustness test: on LLM leaderboards based on the Bradley–Terry model (such as Chatbot Arena), removing a tiny worst-case subset (as few as 2 preferences or 0.003%) of human evaluations can change the top-ranked model. The method precisely identifies which specific preferences cause the flip.
- Flipping the Dialogue: Training and Evaluating User Language Models
-
"Flip" the dialogue—instead of training LLMs to be better assistants, specifically post-train a User Language Model (User LM) to simulate real human users. This model is used to expose the weaknesses of assistant LMs in realistic multi-turn scenarios (dropping GPT-4o's task success rate from 74.6% to 57.4%).
- Non-Collaborative User Simulators for Tool Agents
-
Based on marketing research, this paper defines four types of non-collaborative user behaviors (unavailable service, tangential chat, impatience, and incomplete utterances) and constructs a simulation framework that maintains goal-alignment. Evaluations on MultiWOZ and \(\tau\)-bench systematically expose behavior-specific failure mechanisms in SOTA tool agents—tangential chat leads to an average SR drop of 29.1%, with different models exhibiting distinct collapse paths (the GPT series falls into repetitive helper API calls, while the Qwen series tends to hallucinate API results).
- ReIn: Conversational Error Recovery with Reasoning Inception
-
Proposes Reasoning Inception (ReIn), a test-time intervention method requiring no modification to model parameters or system prompts. By employing an external inception module to detect conversational errors and inject recovery plans into the task agent's reasoning chain, it significantly improves task completion rates across various error scenarios and generalizes to unseen error types.
- Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
-
FlyThinker proposes an efficient "think-while-generating" framework that utilizes an independent Reasoner to generate latent reasoning signals at the token level in parallel. These signals are dynamically integrated into the Generator to guide personalized long-form generation while maintaining training and inference efficiency.
- Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
-
By comparing layer-wise hidden representations (chain-of-embedding) with and without visual input, this study identifies a "Visual Integration Point" (VIP) layer in LVLMs and proposes the Total Visual Integration (TVI) metric to quantify the strength of the language prior.