Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operations¶

Conference: AAAI 2026 arXiv: 2512.04445 Code: None Area: LLM Agents Keywords: document workflows, multi-step operations, rollback mechanism, error recovery, AutoDW

TL;DR¶

This paper proposes AutoDW, a framework that automates complex document workflows through stepwise planning (generating one API call at a time) combined with adaptive rollback (parameter-level and API-level). On DWBench—a benchmark of 250 sessions and 1,708 instructions—AutoDW achieves 90% instruction-level and 62% session-level completion rates, surpassing the strongest baseline by 40% and 76%, respectively.

Background & Motivation¶

Background: LLMs have demonstrated automation capabilities in code generation, data science, and web tasks, yet long-chain workflow automation for document processing (Word editing, format conversion, etc.) remains challenging. Existing document agents perform poorly—e.g., GPT-4 in PPTC achieves only a 6% session completion rate.
Limitations of Prior Work: Real-world document workflows involve multi-step, interdependent instructions (e.g., "set table header → fill second row → merge cells"). Existing agents generate all API calls at once via a predefined plan without adapting to evolving document states, causing cascading failures whenever a single step errs.
Key Challenge: A fundamental gap exists between the ambiguity of natural-language instructions (e.g., "add a header" may correspond to multiple APIs) and the precision required by document operations (API parameters must exactly match the current document state).
Key Insight: Decompose workflows into atomic operations executed and verified step by step, coupled with a two-layer rollback mechanism for automatic error correction to prevent error cascades.

Method¶

Overall Architecture¶

AutoDW consists of three core modules: (1) Stepwise Planning—generating one sub-instruction and its corresponding API call at a time; (2) API Execution & State Tracking—executing APIs in a Python runtime and extracting the document state; (3) Adaptive Rollback—verifying whether execution results align with user intent and triggering parameter-level or API-level rollback upon mismatch.

Key Designs¶

Stepwise Planning
Two-stage generation: first decompose the user instruction into atomic sub-instructions (each completable by a single API call), then generate the specific API call.
Sub-instructions bridge the semantic gap between natural language and API functionality while enabling intent classification to narrow the API search space.
Intent classification: a fine-tuned 178M BERT model classifies instructions into 8 categories (content creation / modification / table / image / chart / formatting / document structure / document lifecycle), achieving 98% test accuracy.
Top-3 intents are retained rather than top-1, improving robustness to ambiguous instructions.
Document State Tracking
Document state is modeled as a 7-tuple: document metadata, paragraph elements, table elements, image elements, page layout, interactive elements, and document styles.
After each API execution, the complete document state is programmatically extracted, providing precise change descriptions for subsequent verification.
State parsing failures are treated as invalid executions and trigger API-level rollback, preventing further planning based on erroneous states.
Adaptive Rollback
Change Analysis: Compares document states before and after execution, detecting changes across six dimensions—structure, content, formatting, style, tables, and hyperlinks.
Alignment Verification: An LLM verifier assesses whether the state changes align with the sub-instruction, returning a binary decision, confidence score, and explanation.
Parameter-Level Rollback: Retains the selected API but updates its parameters based on the verifier's explanation.
API-Level Rollback: Completely reselects the API; triggered when parameter-level rollback also fails.
Single-round rollback (parameter-level → API-level) is the default; experiments confirm diminishing marginal returns beyond one round.
DWBench Benchmark Construction
250 multi-turn sessions, 1,708 manually annotated instructions, and 74 APIs.
Average of 34.8 API calls per session (range: 15–75) and 5.1 API calls per instruction.
Correctness metric: an LLM judge evaluates semantic equivalence between the post-execution document state and the ground-truth state.

Loss & Training¶

The BERT intent classifier is fine-tuned on 3,315 instruction–intent pairs with no overlap with DWBench.
The verifier confidence threshold of 0.6 is determined via sensitivity analysis as the optimal point balancing false negatives and false positives.
The rollback strategy requires no additional training—it relies entirely on the LLM's reasoning capability.

Key Experimental Results¶

Main Results¶

Method	Instruction-Level Accuracy	Session-Level Accuracy	# APIs	Token Usage
Retrieval-only	13.84%	4.40%	4.82	29.6k
Reasoning-only	39.93%	25.20%	5.12	31.6k
Hybrid (PPTC)	64.46%	35.20%	5.30	36.5k
AutoDW	90.33%	62.00%	5.21	42.8k

Ablation Study¶

LLM Backbone	Instruction-Level Accuracy	Session-Level Accuracy	Easy / Medium / Hard
Qwen-Plus	82.82%	53.60%	86.3 / 83.1 / 79.0
DeepSeek-v3	90.33%	62.00%	94.5 / 90.0 / 86.3
Gemini-2.5-Pro	Among best	Among best	High / High / High
GPT-4.1	Among best	Among best	High / High / High

Key Findings¶

76% improvement in session-level completion: from 35.2% (Hybrid) to 62.0% (AutoDW), at the cost of only 25.6% additional token usage.
Hard tasks (>6 APIs) lag the overall average by only 4.4%: demonstrating AutoDW's stability on long-chain complex tasks.
Strong cross-LLM robustness: all four LLM backbones perform well; even the weakest, Qwen-Plus, achieves 82.8% instruction-level accuracy.
Cost-effectiveness of rollback: single-round two-layer rollback is the optimal strategy; additional rounds yield diminishing returns.
~60% of rollbacks occur at format-conversion steps: document format handling remains a weak point for LLMs.

Highlights & Insights¶

Generality of the "stepwise + rollback" paradigm: beyond document automation, this paradigm is transferable to any multi-step execution task such as code generation and data pipelines.
Completeness of the 7-tuple document state representation: precise state tracking is the foundation of the rollback mechanism—accurate verification is impossible without accurate state.
Efficiency of the 178M BERT intent classifier: the design principle of delegating fixed classification to a small model and flexible reasoning to a large model is worth emulating.

Limitations & Future Work¶

Currently limited to Word documents (.docx); Excel, PowerPoint, PDF, and other formats are not covered.
The 74 APIs cover common operations but fall far short of the full complexity of real-world Office APIs.
The verifier's confidence threshold calibration relies on empirical tuning; adaptive thresholding warrants exploration.
The session-level completion rate of 62%, while substantially ahead of baselines, still leaves considerable room for improvement.

vs. PPTC (PPT automation baseline): PPTC uses a predefined plan with a rule-based mapper and has no error recovery capability; AutoDW's stepwise planning and adaptive rollback yield robust performance across varying task complexity.
vs. DocPilot / TableTalk (human-in-the-loop): these systems rely on human verification at each step; AutoDW replaces human validation with an LLM verifier to achieve full automation.

Rating¶

Dimension	Score	Rationale
Novelty	⭐⭐⭐⭐	Stepwise planning combined with two-layer rollback is a novel and practical design for document agents.
Technical Depth	⭐⭐⭐⭐	The 7-tuple state tracking, 6-dimensional change analysis, and two-layer rollback constitute a complete system design.
Experimental Thoroughness	⭐⭐⭐⭐⭐	Large-scale benchmark of 250 sessions, 4 LLM backbones, difficulty gradients, and ablation studies.
Value	⭐⭐⭐⭐⭐	Directly addresses practical pain points in office automation; a 90% instruction completion rate approaches production readiness.