Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions¶
Conference: ICLR2026
arXiv: 2511.03047
Code: Not released (paper indicates release with final version)
Area: LLM/NLP
Keywords: Unsupervised Evaluation, Multi-Turn Dialogue, Goal Completion, LLM Uncertainty, Response Tree, LLM-Guided Clustering
Authors: Emi Soroka, Tanmay Chopra, Krish Desai, Sanjay Lall (Stanford & Emissary Technologies)
TL;DR¶
Three unsupervised metrics are proposed—LLM-guided clustering (goal identification), interaction completeness detection via fine-tuned completion models, and response trees (LLM uncertainty quantification)—for evaluating multi-turn objective-driven dialogues without labeled data or LLM-as-a-judge, achieving performance that matches or exceeds a 70B judge using only an 8B model.
Background & Motivation¶
Evaluation of enterprise LLM systems is challenging: Task-oriented dialogue, AI agents, and customer service systems involving objective-driven interactions are increasingly prevalent, yet evaluation methods lag significantly behind—data are complex and unannotated, and manual labeling does not scale.
LLM-as-a-judge is unreliable: Well-documented issues include position bias, verbosity bias, familiarity bias, output inconsistency, and sensitivity to prompt phrasing.
Distribution shift: Objective-driven systems introduce reasoning, tool invocation, multi-agent interaction, and shared environment manipulation, all of which diverge from the base conversational distribution on which LLMs are pretrained, further complicating evaluation.
Limitations of existing metrics: ROUGE/BLEU require reference answers; perplexity is informationally limited; custom metrics can only monitor known error types.
Core Problem: Design evaluation metrics that require zero annotations and zero reference answers, and that can automatically discover user goals, detect interaction completeness, and quantify LLM uncertainty.
Method¶
Metric 1: LLM-Guided Clustering (User Goal Identification)¶
Objective: Automatically discover and label user goal categories from unannotated multi-turn dialogues.
Three-stage algorithm (Algorithm 1):
Preprocessing: For each dialogue \(c_i\), an LLM is prompted to generate a free-text goal summary \(s_i\), which is then embedded as \(v_i \in \mathbb{R}^{1536}\) using text-embedding-3-small.
Phase 1 — Initial Clustering + Labeling: - K-means is applied to \(v_1, \dots, v_n\) to obtain \(k_1\) initial clusters (with \(k_1\) set as a generous overestimate) - For each cluster, 10 positive and 10 negative samples are drawn, and an LLM is prompted to generate a cluster description \(L_i\) - All descriptions are embedded to obtain \(d_1, \dots, d_{k_1}\)
Phase 2 — Iterative Merging: - A pairwise cosine similarity matrix is computed: \(D_{ij} = \frac{d_i^\top d_j}{\|d_i\|_2 \|d_j\|_2}\) - The pair with the maximum \(D_{ij}\) is iteratively selected, and the LLM is prompted to decide whether to merge (with 10 positive and negative samples each time) - Merged clusters receive regenerated descriptions; termination occurs when all current cluster pairs are rejected for merging
Advantages: Combines the stability of k-means with the semantic understanding of LLMs, producing interpretable clusters with natural language labels.
Metric 2: Interaction Completeness Detection (Goal Completion)¶
Core Idea: A fine-tuned LLM learns the "completion distribution" and detects completeness by predicting whether a dialogue should terminate.
Formal definition: Given a distribution \(D\) of complete dialogues, a new distribution \(D'\) is constructed in which the final response of each complete dialogue is appended with an end token. This yields:
For a complete dialogue \(c\) and a truncated dialogue \(c'\) (with \(k < n\) turns), the following is expected to hold:
Implementation:
- Base distribution (e.g., LMSYS): LLaMA3.1-8B-Instruct with a short prompt is used directly
- Domain-specific distribution (e.g., insurance underwriting, code debugging): A LoRA adapter is trained to fine-tune LLaMA3.2-8B as a completion model
- Input: \(\text{concat}(p_1, r_1, \dots, p_n)\)
- Target: \(r_n\) + end token
- Training: AdamW 8-bit, lr = 0.0002, weight decay = 0.01, 3 epochs, 50% of data
- Incomplete dialogues: The model does not output end but instead generates subsequent turns \(p_{n+1}, r_{n+1}, \dots\), which can additionally summarize the remaining tasks the LLM has not yet completed
Metric 3: Response Trees (Response Uncertainty)¶
Objective: Quantify LLM response uncertainty for a given prompt without repeated high-temperature sampling.
Response tree definition: Given a prompt \(p\) and a threshold probability \(\alpha\), \(\text{rtree}_{D,\alpha}(p)\) returns a tree of all branches whose traversal probability is \(\ge \alpha\).
Construction: 1. Generate one response along with its top-\(k\) log probabilities 2. If the log probability of the 2nd through \(k\)-th tokens exceeds \(\alpha\), generate separate branches for each 3. Recurse until no log probability exceeds \(\alpha\) or a computational threshold is reached
Uncertainty quantification: - Leaf node count: More leaves → more possible responses → higher uncertainty → greater likelihood of error - Maximum log probability: Higher values indicate greater model confidence in the optimal response - Both metrics exhibit low correlation with dialogue length (\(r\) ranging from \(-0.25\) to \(0.41\)), suggesting that response trees capture more complex uncertainty than length-dependent signals
Key Experimental Results¶
Datasets¶
| Dataset | Size | Domain | Objective-Driven | Tool Use |
|---|---|---|---|---|
| LMSYS-Chat-1M | 1000 | Unstructured dialogue | ✗ | ✗ |
| Code-Feedback | 1000 | Code generation & debugging | ✓ | ✗ |
| Insurance | 380 | Insurance underwriting | ✓ | ✓ |
| WebShop | 351 | Online shopping interaction | ✓ | ✓ |
| SQL+OS+KB | 1043 | SQL / terminal / knowledge base | ✓ | ✓ |
Main Results¶
Completeness Detection Results¶
| Dataset (Evaluator) | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| LMSYS (70B judge) | 0.43 | 0.77 | 0.25 | 0.38 |
| LMSYS (8B completion) | 0.74 | 0.79 | 0.85 | 0.82 |
| Code-Feedback (70B judge) | 0.53 | 0.53 | 0.46 | 0.49 |
| Code-Feedback (FT 8B) | 0.47 | 0.71 | 0.12 | 0.21 |
| Insurance (70B judge) | 0.95 | 1.0 | 0.91 | 0.95 |
| Insurance (FT 8B) | 0.91 | 0.94 | 0.87 | 0.91 |
| WebShop (70B judge) | 0.92 | 1.0 | 0.83 | 0.91 |
| WebShop (FT 8B) | 0.92 | 0.89 | 1.0 | 0.94 |
| SQL+OS+KB (70B judge) | 0.97 | 0.96 | 0.97 | 0.96 |
| SQL+OS+KB (FT 8B) | 0.98 | 0.99 | 0.98 | 0.99 |
Key Findings¶
- The fine-tuned 8B model matches or surpasses the 70B LLM judge on most datasets
- The
endtoken is a critical design choice (removing it on Insurance reduces F1 from 0.91 to 0.72) - Code-Feedback presents the greatest difficulty due to loosely structured dialogues that can be continued at any point
Response Tree Statistics¶
| Metric | LMSYS | Code-Feedback | Insurance | WebShop | KB+OS+SQL |
|---|---|---|---|---|---|
| Max logprob vs. length | -0.11 | -0.19 | -0.25 | 0.16 | 0.41 |
| Max logprob vs. leaf count | -0.49 | -0.46 | -0.10 | -0.19 | -0.06 |
- KB+OS+SQL exhibits the highest uncertainty, as tool invocation, SQL, and terminal interactions diverge most from the base distribution
- LMSYS and Code-Feedback show the highest confidence, being closest to the pretraining distribution
Clustering Stability¶
- LMSYS, WebShop, and SQL+OS+KB produce highly stable clusters across runs
- Code-Feedback and Insurance show slightly lower stability (amenable to multi-dimensional labeling: language vs. task type)
- Compared to a GPT-4.1 LLM-only labeling baseline: the LLM-only approach degenerates to a single cluster ("Online Shopping and Purchase") on WebShop
Highlights & Insights¶
- Three complementary metrics providing complete coverage: goal identification (what) + completeness detection (whether) + uncertainty quantification (how confident), spanning the core dimensions of evaluation
- Zero annotations, zero references: Genuinely unsupervised, requiring neither ground truth nor an LLM judge
- Strong performance from a small model: An 8B fine-tuned model matches or exceeds a 70B judge, making online deployment and real-time monitoring feasible
- Response tree innovation: More structured and informationally richer than semantic entropy, which requires multiple high-temperature samples
- Distribution adaptability: LoRA fine-tuning adapts to domain-specific token distributions
Limitations & Future Work¶
- Clustering depends on the initial setting of \(k_1\), bounding the maximum number of discoverable clusters
- Completeness detection performs poorly on loosely structured dialogues (e.g., Code-Feedback, where the first turn may suffice and subsequent turns are follow-up questions)
- Multi-label classification is not supported (a single dialogue may involve multiple goals)
- Response trees lack ground-truth validation (no direct evidence that high uncertainty corresponds to errors)
- Limited fine-tuning data (Insurance uses only 190 training samples, leading to larger performance variance)
- Validation is confined to synthetic/public datasets, with no deployment testing in real enterprise systems
Related Work & Insights¶
| Method | Requires Labels | Requires References | Requires LLM Judge | Model Scale | Multi-Turn |
|---|---|---|---|---|---|
| ROUGE/BLEU | ✗ | ✓ | ✗ | — | ✗ |
| BERTScore | ✗ | ✓ | ✗ | ~110M | ✗ |
| HelpSteer | ✓ | ✗ | ✗ | — | ✓ |
| G-EVAL | ✗ | ✗ | ✓ | >70B | ✓ |
| DeepEval | ✗ | ✗ | ✓ | >70B | ✓ |
| Ours | ✗ | ✗ | ✗ | 8B | ✓ |
Additional insights: - Online intervention potential: Completeness detection can terminate unproductive dialogues early to conserve tokens; uncertainty quantification can trigger human escalation - Response trees + sampling strategies: If the LLM's sampling strategy is known, response trees can provide statistical guarantees over output probabilities - Complementarity with conformal prediction: This work targets the unsupervised setting, while conformal methods offer supervised guarantees; the two approaches are combinable - LoRA fine-tuning as a distribution adapter: A general paradigm—fine-tuning an 8B model with small-data LoRA to adapt to domain-specific token distributions
Rating¶
- Novelty: ⭐⭐⭐⭐ — All three metrics offer individual contributions; LLM-guided clustering and response trees represent novel advances
- Experimental Thoroughness: ⭐⭐⭐⭐ — Six datasets and multiple ablations, though response trees lack direct effectiveness validation
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear and rigorous; appendix is comprehensive
- Value: ⭐⭐⭐⭐ — Fills a gap in unsupervised evaluation of multi-turn objective-driven dialogues with strong practical applicability