StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following¶
Conference: ACL 2025
arXiv: 2502.14494
Code: Available
Area: NLP / Dialogue Evaluation / Instruction Following
Keywords: Multi-turn dialogue, structural flow, instruction following, benchmark, dialogue structure modeling
TL;DR¶
Introduces StructFlowBench, a multi-turn instruction following benchmark integrating structural flow modeling, which defines six fundamental turn-to-turn relationships (Follow-up, Refinement, Recall, Summary, Expansion, Unrelatedness) and establishes a dual-layer constraint evaluation system (intra-turn constraints + inter-turn structural constraints) to systematically evaluate the capability of 13 major LLMs in understanding multi-turn dialogue structures.
Background & Motivation¶
Multi-turn instruction following is a core capability of LLMs in real-world scenarios, but existing evaluation methods suffer from three key limitations:
Inability to model complex scenarios: They simplify multi-turn dialogues into a linear concatenation of single-turn interactions, failing to capture logical coherence, user goal clarity, and natural transitions in real-world dialogues.
Methodological bias: Single-turn evaluation strategies dissect the structural connections between turns, ignoring multi-turn structural constraints.
Insufficient analysis: Existing methods overemphasize intra-turn constraint satisfaction, lacking a systematic framework to describe the conversational structural flow.
Key Insight: Multi-turn dialogue is not a simple concatenation of independent single turns—users demonstrate planning and intentionality in long dialogues, showing structural dependencies across turns. These dependencies are the key dimension distinguishing multi-turn interactions from single-turn ones, and serve as an indispensable second dimension in evaluation.
Method¶
Overall Architecture¶
StructFlowBench comprises two core components: 1. Six-class Structural Flow Taxonomy: Describing turn-to-turn relationships. 2. Dual-layer Constraint Evaluation System: Intra-turn constraints + Inter-turn structural constraints.
Key Designs¶
- Six-class Structural Flow Taxonomy (Structural Flow Taxonomy)
| Structure Type | Scope | Description |
|---|---|---|
| Follow-up | Adjacent turns | In-depth exploration based on the previous turn |
| Refinement | Adjacent turns | Modifying or clarifying the prompt from the previous turn |
| Recall | Long distance | Referencing content from two or more turns ago |
| Expansion | Multi-turn fan-out | Exploring multiple subtopics after introducing a topic |
| Summary | Multi-turn fan-in | Comprehensive overview integrating content from multiple turns |
| Unrelatedness | Any | A brand new topic, unrelated to previous turns |
Design Motivation: Patterns identified through analyzing real-world dialogue datasets such as WildChat and LMSYS-Chat-1M.
-
Dual-layer Constraint System
- Intra-turn constraints (8 categories): Negative constraints, style constraints, situational constraints, keyword/element constraints, basic formatting constraints, quantitative formatting constraints, template formatting constraints, content constraints.
- Inter-turn structural constraints (5 categories): Corresponding to the five structural relationships excluding Unrelatedness.
- Structural constraints ensure that the model maintains logical coherence across turns while satisfying single-turn requirements.
-
Data Generation Pipeline (Two-step Dialogue Generation)
- Parameter settings: Selecting task types (8 types), topics (22 types), user personas (expert/non-expert), and structural flow templates (14 hand-designed templates).
- First step: Generating intermediate dialogue plans (abstract-style prompts) leveraging GPT-4o with structural flow templates.
- Second step: Generating full dialogues (user prompts + LLM responses) based on the intermediate plans.
- Constraint extraction and addition: GPT-4o extracts intra-turn constraints + Structural constraints are added based on structural flow information.
- Scale: 155 multi-turn dialogues, 643 turns, 1775 constraints.
-
Evaluation Methodology
- Adopt the "Golden Context" method: using a curated dataset as the dialogue history instead of the contexts generated by models themselves.
- Evaluation based on constraint breakdown and binary questions: each instruction is broken down into multiple independent constraints \(\rightarrow\) binary questions (Yes/No) are designed for each constraint.
- Utilizing GPT-4o as the evaluator.
-
Evaluation Metrics
| Metric | Definition |
|---|---|
| CSR | Constraint Satisfaction Rate: the average ratio of satisfied constraints across all instructions |
| ISR | Instruction Satisfaction Rate: the ratio of instructions with all constraints satisfied |
| DRFR | Decomposed Requirement Following Rate: global constraint satisfaction ratio |
| WCSR (Newly proposed) | Weighted Constraint Satisfaction Rate: structural constraint weight \(w_s=2\), intra-turn constraint weight \(w_r=1\) |
Loss & Training¶
This work is an evaluation study and does not involve model training.
Key Experimental Results¶
Main Results (Evaluation of 13 LLMs)¶
| Model | follow-up | refinement | expansion | summary | recall | CSR | ISR | WCSR |
|---|---|---|---|---|---|---|---|---|
| DeepSeek-v3 | 0.99 | 0.80 | 0.92 | 1.00 | 1.00 | 0.97 | 0.93 | 0.96 |
| GPT-4o | 0.98 | 0.78 | 0.88 | 0.97 | 0.91 | 0.96 | 0.90 | 0.95 |
| Claude-3.5-Sonnet | 0.98 | 0.80 | 0.88 | 1.00 | 0.91 | 0.95 | 0.89 | 0.94 |
| Qwen2.5-7B | 0.95 | 0.76 | 0.90 | 0.94 | 0.97 | 0.93 | 0.84 | 0.92 |
| Llama-3.1-8B | 0.96 | 0.71 | 0.84 | 0.79 | 0.94 | 0.84 | 0.69 | 0.83 |
| DS-R1-Distill-Qwen-7B | 0.91 | 0.62 | 0.85 | 0.86 | 0.78 | 0.81 | 0.70 | 0.80 |
Difficulty Comparison of Structural Types¶
| Structural Type | Average Score of All Models | Difficulty Ranking |
|---|---|---|
| Summary | ~0.94 | Easiest |
| Follow-up | ~0.96 | Easy |
| Recall | ~0.92 | Medium |
| Expansion | ~0.87 | Harder |
| Refinement | ~0.73 | Hardest |
Key Findings¶
- Refinement is the greatest challenge: All models perform the worst on the refinement structure, suggesting that LLMs struggle to adapt their responses effectively to the user's intent to revise.
- Closed-source models lead overall: DeepSeek-v3, GPT-4o, and Claude-3.5-Sonnet achieve the best performance.
- Distilled reasoning models perform poorly: The DeepSeek-R1-Distill series lags significantly in structural understanding, potentially because the distillation process compromises structure-awareness.
- ISR is far lower than CSR: Indicating a substantial gap in the models' ability to satisfy all multiple constraints simultaneously.
- WCSR reflects real capability better than CSR: The weighted metric highlights the importance of structural constraints.
Highlights & Insights¶
- Pioneering Framework: Formulates a structural flow taxonomy for multi-turn dialogue for the first time, formalizing inter-turn relationships into six fundamental structures.
- Triple Functions: The structural flow taxonomy simultaneously serves structural diagnosis, intent inference, and controllable generation.
- WCSR Metric Design: Distinguishes structural constraints (higher importance, weight = 2) from intra-turn constraints (weight = 1) through weighting.
- Golden Context Evaluation Strategy: Utilizes standardized dialogue history to eliminate error accumulation in context.
- Scalable Generation Paradigm: 14 structural flow templates can be combined to generate diverse evaluation dialogues.
Limitations & Future Work¶
- The dataset size is relatively small (155 dialogues), which may not fully cover all combinations of structural patterns.
- Structural flow templates are hand-designed, potentially omitting certain structural patterns found in real dialogues.
- The evaluation heavily relies on GPT-4o as the evaluator, introducing evaluator bias.
- There are no structural constraints designed for the Unrelatedness structure.
- The impact of cultural and linguistic differences on dialogue structure is not considered.
- Compared to the average length of real-world dialogues (which could be longer), the average length of 4.14 turns might be relatively short.
Related Work & Insights¶
- MT-Bench (Zheng et al., 2023): Pioneered multi-turn dialogue evaluation but did not model structural relationships.
- MT-Eval (Kwan et al., 2024): Partially explored four multi-turn structures (Recall, Expansion, Refinement, Follow-up) but lacked a systematic framework.
- ComplexBench (Wen et al., 2024): Explored combinations of constraints in single-turn complex instructions.
- IFEval (Zhou et al., 2023): Foundational work in instruction following evaluation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Highly original design in structural flow taxonomy and dual-layer constraint evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ In-depth analysis of 13 models, multi-dimensional metrics, and structural types.
- Writing Quality: ⭐⭐⭐⭐ Clear description of the taxonomy and rich diagrams.
- Value: ⭐⭐⭐⭐⭐ Opens up a new dimension of structured analysis for multi-turn dialogue evaluation.