StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following¶

Conference: ACL 2025
arXiv: 2502.14494
Code: Available
Area: NLP / Dialogue Evaluation / Instruction Following
Keywords: Multi-turn dialogue, structural flow, instruction following, benchmark, dialogue structure modeling

TL;DR¶

Introduces StructFlowBench, a multi-turn instruction following benchmark integrating structural flow modeling, which defines six fundamental turn-to-turn relationships (Follow-up, Refinement, Recall, Summary, Expansion, Unrelatedness) and establishes a dual-layer constraint evaluation system (intra-turn constraints + inter-turn structural constraints) to systematically evaluate the capability of 13 major LLMs in understanding multi-turn dialogue structures.

Background & Motivation¶

Multi-turn instruction following is a core capability of LLMs in real-world scenarios, but existing evaluation methods suffer from three key limitations:

Inability to model complex scenarios: They simplify multi-turn dialogues into a linear concatenation of single-turn interactions, failing to capture logical coherence, user goal clarity, and natural transitions in real-world dialogues.

Methodological bias: Single-turn evaluation strategies dissect the structural connections between turns, ignoring multi-turn structural constraints.

Insufficient analysis: Existing methods overemphasize intra-turn constraint satisfaction, lacking a systematic framework to describe the conversational structural flow.

Key Insight: Multi-turn dialogue is not a simple concatenation of independent single turns—users demonstrate planning and intentionality in long dialogues, showing structural dependencies across turns. These dependencies are the key dimension distinguishing multi-turn interactions from single-turn ones, and serve as an indispensable second dimension in evaluation.

Method¶

Overall Architecture¶

StructFlowBench comprises two core components: 1. Six-class Structural Flow Taxonomy: Describing turn-to-turn relationships. 2. Dual-layer Constraint Evaluation System: Intra-turn constraints + Inter-turn structural constraints.

Key Designs¶

Six-class Structural Flow Taxonomy (Structural Flow Taxonomy)

Structure Type	Scope	Description
Follow-up	Adjacent turns	In-depth exploration based on the previous turn
Refinement	Adjacent turns	Modifying or clarifying the prompt from the previous turn
Recall	Long distance	Referencing content from two or more turns ago
Expansion	Multi-turn fan-out	Exploring multiple subtopics after introducing a topic
Summary	Multi-turn fan-in	Comprehensive overview integrating content from multiple turns
Unrelatedness	Any	A brand new topic, unrelated to previous turns

Design Motivation: Patterns identified through analyzing real-world dialogue datasets such as WildChat and LMSYS-Chat-1M.

Dual-layer Constraint System
- Intra-turn constraints (8 categories): Negative constraints, style constraints, situational constraints, keyword/element constraints, basic formatting constraints, quantitative formatting constraints, template formatting constraints, content constraints.
- Inter-turn structural constraints (5 categories): Corresponding to the five structural relationships excluding Unrelatedness.
- Structural constraints ensure that the model maintains logical coherence across turns while satisfying single-turn requirements.
Data Generation Pipeline (Two-step Dialogue Generation)
- Parameter settings: Selecting task types (8 types), topics (22 types), user personas (expert/non-expert), and structural flow templates (14 hand-designed templates).
- First step: Generating intermediate dialogue plans (abstract-style prompts) leveraging GPT-4o with structural flow templates.
- Second step: Generating full dialogues (user prompts + LLM responses) based on the intermediate plans.
- Constraint extraction and addition: GPT-4o extracts intra-turn constraints + Structural constraints are added based on structural flow information.
- Scale: 155 multi-turn dialogues, 643 turns, 1775 constraints.
Evaluation Methodology
- Adopt the "Golden Context" method: using a curated dataset as the dialogue history instead of the contexts generated by models themselves.
- Evaluation based on constraint breakdown and binary questions: each instruction is broken down into multiple independent constraints \(\rightarrow\) binary questions (Yes/No) are designed for each constraint.
- Utilizing GPT-4o as the evaluator.
Evaluation Metrics

Metric	Definition
CSR	Constraint Satisfaction Rate: the average ratio of satisfied constraints across all instructions
ISR	Instruction Satisfaction Rate: the ratio of instructions with all constraints satisfied
DRFR	Decomposed Requirement Following Rate: global constraint satisfaction ratio
WCSR (Newly proposed)	Weighted Constraint Satisfaction Rate: structural constraint weight \(w_s=2\), intra-turn constraint weight \(w_r=1\)

Loss & Training¶

This work is an evaluation study and does not involve model training.

Key Experimental Results¶

Main Results (Evaluation of 13 LLMs)¶

Model	follow-up	refinement	expansion	summary	recall	CSR	ISR	WCSR
DeepSeek-v3	0.99	0.80	0.92	1.00	1.00	0.97	0.93	0.96
GPT-4o	0.98	0.78	0.88	0.97	0.91	0.96	0.90	0.95
Claude-3.5-Sonnet	0.98	0.80	0.88	1.00	0.91	0.95	0.89	0.94
Qwen2.5-7B	0.95	0.76	0.90	0.94	0.97	0.93	0.84	0.92
Llama-3.1-8B	0.96	0.71	0.84	0.79	0.94	0.84	0.69	0.83
DS-R1-Distill-Qwen-7B	0.91	0.62	0.85	0.86	0.78	0.81	0.70	0.80

Difficulty Comparison of Structural Types¶

Structural Type	Average Score of All Models	Difficulty Ranking
Summary	~0.94	Easiest
Follow-up	~0.96	Easy
Recall	~0.92	Medium
Expansion	~0.87	Harder
Refinement	~0.73	Hardest

Key Findings¶

Refinement is the greatest challenge: All models perform the worst on the refinement structure, suggesting that LLMs struggle to adapt their responses effectively to the user's intent to revise.
Closed-source models lead overall: DeepSeek-v3, GPT-4o, and Claude-3.5-Sonnet achieve the best performance.
Distilled reasoning models perform poorly: The DeepSeek-R1-Distill series lags significantly in structural understanding, potentially because the distillation process compromises structure-awareness.
ISR is far lower than CSR: Indicating a substantial gap in the models' ability to satisfy all multiple constraints simultaneously.
WCSR reflects real capability better than CSR: The weighted metric highlights the importance of structural constraints.

Highlights & Insights¶

Pioneering Framework: Formulates a structural flow taxonomy for multi-turn dialogue for the first time, formalizing inter-turn relationships into six fundamental structures.
Triple Functions: The structural flow taxonomy simultaneously serves structural diagnosis, intent inference, and controllable generation.
WCSR Metric Design: Distinguishes structural constraints (higher importance, weight = 2) from intra-turn constraints (weight = 1) through weighting.
Golden Context Evaluation Strategy: Utilizes standardized dialogue history to eliminate error accumulation in context.
Scalable Generation Paradigm: 14 structural flow templates can be combined to generate diverse evaluation dialogues.

Limitations & Future Work¶

The dataset size is relatively small (155 dialogues), which may not fully cover all combinations of structural patterns.
Structural flow templates are hand-designed, potentially omitting certain structural patterns found in real dialogues.
The evaluation heavily relies on GPT-4o as the evaluator, introducing evaluator bias.
There are no structural constraints designed for the Unrelatedness structure.
The impact of cultural and linguistic differences on dialogue structure is not considered.
Compared to the average length of real-world dialogues (which could be longer), the average length of 4.14 turns might be relatively short.

MT-Bench (Zheng et al., 2023): Pioneered multi-turn dialogue evaluation but did not model structural relationships.
MT-Eval (Kwan et al., 2024): Partially explored four multi-turn structures (Recall, Expansion, Refinement, Follow-up) but lacked a systematic framework.
ComplexBench (Wen et al., 2024): Explored combinations of constraints in single-turn complex instructions.
IFEval (Zhou et al., 2023): Foundational work in instruction following evaluation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Highly original design in structural flow taxonomy and dual-layer constraint evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐ In-depth analysis of 13 models, multi-dimensional metrics, and structural types.
Writing Quality: ⭐⭐⭐⭐ Clear description of the taxonomy and rich diagrams.
Value: ⭐⭐⭐⭐⭐ Opens up a new dimension of structured analysis for multi-turn dialogue evaluation.