SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction¶

Conference: NeurIPS 2025 arXiv: 2509.00581 Code: None Area: LLM Reasoning / NLP Keywords: text-to-SQL, multi-agent, error taxonomy, chain-of-thought, Spider benchmark

TL;DR¶

This paper proposes SQL-of-Thought, a multi-agent Text-to-SQL framework that decomposes the task into schema linking → subproblem identification → CoT query plan generation → SQL generation → guided correction loop based on a 31-category error taxonomy. Using Claude 3 Opus on the Spider benchmark, it achieves 91.59% execution accuracy, outperforming the previous best Chase SQL (87.6%) by nearly 4 percentage points.

Background & Motivation¶

Background: Text-to-SQL has evolved from sequence-to-sequence models to LLM prompting methods (DIN-SQL, DAIL-SQL), with multi-agent approaches (MAC-SQL, Chase SQL) further improving modularity and accuracy.

Limitations of Prior Work: (a) Existing error correction relies solely on execution feedback — 95–99% of generated SQL is syntactically correct, yet logical errors (e.g., wrong JOIN types, missing aggregations) cannot be detected via execution signals; (b) unguided reasoning may introduce new errors; (c) there is no systematic error taxonomy to guide correction.

Key Challenge: Syntactic correctness ≠ semantic correctness — structured error diagnosis beyond execution feedback is required.

Key Insight: Design a 31-category error taxonomy, combined with CoT reasoning in the correction loop to precisely locate and repair logical errors.

Method¶

Overall Architecture¶

Five specialized agents execute sequentially with a guided correction loop: Schema Linking → Subproblem → Query Plan (CoT) → SQL Generation → execution test → [on failure] Correction Plan (CoT + error taxonomy) → Correction SQL → re-execution.

Key Designs¶

Staged Reasoning:
- Schema Linking Agent: identifies relevant tables, columns, and primary/foreign keys.
- Subproblem Agent: decomposes the query into clause-level subproblems (WHERE/GROUP BY/JOIN, etc.) and outputs structured JSON.
- Query Plan Agent: generates a step-by-step execution plan via CoT (SQL generation is prohibited at this stage).
- SQL Agent: generates executable SQL based on the plan.
31-Category Error Taxonomy:
- Covers 9 major categories: syntax errors, schema linking errors, JOIN errors, filter condition errors, aggregation logic errors, value representation errors, subquery errors, set operation errors, and structural omissions.
- Uses concise error codes (rather than verbose descriptions) to conserve context window space.
- The Correction Plan Agent is prompted to reference the taxonomy, diagnosing the error type before formulating a repair strategy.
Guided Correction Loop:
- Unlike DIN-SQL (regeneration only) and DAIL-SQL (execution feedback only), the correction loop provides error type + CoT repair plan.
- A two-step correction process (Correction Plan Agent → Correction SQL Agent) is used instead of direct correction.
- The loop continues until execution succeeds or the maximum number of attempts is reached.

Key Experimental Results¶

Spider Benchmark¶

Method	Spider EA	Spider-Realistic EA
DIN-SQL + GPT-4	82.8%	78.1%
DAIL-SQL + GPT-4 + SC	83.6%	75.2%
MAC-SQL + GPT-4	86.8%	-
Tool-SQL + GPT-4	86.9%	82.9%
Chase SQL	87.6%	-
SQL-of-Thought + Claude 3 Opus	91.59%	90.16%

Ablation Study (100 samples)¶

Configuration	Accuracy	Note
SQL-of-Thought (full)	95%	Complete framework
w/o correction loop	85%	−10%; correction is critical
w/o Query Plan	90%	−5%; CoT planning is beneficial

Model Comparison¶

Model	SQL-of-Thought EA
Claude 3 Opus	95%
GPT-5	89%
GPT-4o-mini	87%
GPT-3.5	67%
Llama-3.1-8B	~45%

Key Findings¶

Correction loop contributes +10%: SQL that is syntactically correct but logically erroneous requires structured diagnosis to repair.
CoT Query Plan contributes +5%: Planning before generating SQL is more reliable than direct generation.
Claude 3 Opus performs best: It demonstrates the strongest reasoning capability and SQL generation accuracy among all evaluated models.
Cost trade-off: A single Spider run costs ~$42; a hybrid model strategy can reduce this to ~$30 (85% EA).

Highlights & Insights¶

Value of the error taxonomy: The structured 31-category taxonomy transforms LLM behavior from "blind retry" to "targeted repair" — removing the taxonomy exacerbates repeated corrections of identical errors.
Two-step correction > direct correction: Generating a correction plan before generating SQL is more effective than directly feeding error information to the SQL Agent — LLMs benefit from structured intermediate reasoning steps.
Lessons from failed ablations: Assigning multiple repair agents to each fix a different error type and then merging results leads to conflicts; carrying correction history causes context bloat and performance degradation.
Large gap for open-source models: Llama-3.1-8B achieves only 45%, exposing the limitations of smaller models on complex structured generation tasks.

Limitations & Future Work¶

Evaluation limited to the Spider series: Spider does not reflect the complexity of real-world databases (the TAG framework indicates it covers only ~20% of real queries).
High API cost: The multi-agent framework costs ~$42 per run, limiting practical deployment.
Error taxonomy coverage unverified: Coverage across diverse query structures remains unknown.
Future directions: (1) Evaluation on BIRD-SQL and real-world databases; (2) fine-tuning smaller models to replace API calls and reduce cost; (3) automatic learning and updating of the error taxonomy.

vs. DIN-SQL: DIN-SQL's correction merely regenerates the prompt without specific error information; SQL-of-Thought provides taxonomy-guided, targeted correction.
vs. Chase SQL: Chase SQL employs a multi-candidate selection strategy, whereas SQL-of-Thought uses a single-path iterative correction approach — making it more efficient.
vs. Think2SQL: Think2SQL finds that reasoning offers mixed benefits for SQL generation; SQL-of-Thought eliminates the "unguided reasoning can be harmful" problem through staged reasoning (plan → SQL) and taxonomy-guided correction.
Insight: For structured output generation (SQL/code), a "diagnose → plan → repair" paradigm is more effective than "detect → retry."

Rating¶

Novelty: ⭐⭐⭐⭐ Taxonomy-guided correction is a novel and practical design.
Experimental Thoroughness: ⭐⭐⭐ Limited to the Spider series; lacks evaluation on BIRD-SQL and real-world databases.
Writing Quality: ⭐⭐⭐⭐ Architecture is clearly presented; ablation analysis is thorough, including lessons from failed ablations.
Value: ⭐⭐⭐⭐ Spider SOTA results are convincing, but generalization to harder benchmarks requires further validation.

Rating¶

Novelty: ⭐⭐⭐ Combination of multi-agent framework and error taxonomy.
Experimental Thoroughness: ⭐⭐⭐ Standard benchmark evaluation.
Writing Quality: ⭐⭐⭐
Value: ⭐⭐⭐ Practical reference value for text-to-SQL applications.

SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Key Experimental Results¶

Spider Benchmark¶

Ablation Study (100 samples)¶

Model Comparison¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Rating¶

Related Papers¶