Skip to content

SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

Conference: NeurIPS 2025 arXiv: 2509.00581 Code: None Area: LLM Reasoning / NLP Keywords: text-to-SQL, multi-agent, error taxonomy, chain-of-thought, Spider benchmark

TL;DR

This paper proposes SQL-of-Thought, a multi-agent Text-to-SQL framework that decomposes the task into schema linking → subproblem identification → CoT query plan generation → SQL generation → guided correction loop based on a 31-category error taxonomy. Using Claude 3 Opus on the Spider benchmark, it achieves 91.59% execution accuracy, outperforming the previous best Chase SQL (87.6%) by nearly 4 percentage points.

Background & Motivation

Background: Text-to-SQL has evolved from sequence-to-sequence models to LLM prompting methods (DIN-SQL, DAIL-SQL), with multi-agent approaches (MAC-SQL, Chase SQL) further improving modularity and accuracy.

Limitations of Prior Work: (a) Existing error correction relies solely on execution feedback — 95–99% of generated SQL is syntactically correct, yet logical errors (e.g., wrong JOIN types, missing aggregations) cannot be detected via execution signals; (b) unguided reasoning may introduce new errors; (c) there is no systematic error taxonomy to guide correction.

Key Challenge: Syntactic correctness ≠ semantic correctness — structured error diagnosis beyond execution feedback is required.

Key Insight: Design a 31-category error taxonomy, combined with CoT reasoning in the correction loop to precisely locate and repair logical errors.

Method

Overall Architecture

Five specialized agents execute sequentially with a guided correction loop: Schema Linking → Subproblem → Query Plan (CoT) → SQL Generation → execution test → [on failure] Correction Plan (CoT + error taxonomy) → Correction SQL → re-execution.

Key Designs

  1. Staged Reasoning:

    • Schema Linking Agent: identifies relevant tables, columns, and primary/foreign keys.
    • Subproblem Agent: decomposes the query into clause-level subproblems (WHERE/GROUP BY/JOIN, etc.) and outputs structured JSON.
    • Query Plan Agent: generates a step-by-step execution plan via CoT (SQL generation is prohibited at this stage).
    • SQL Agent: generates executable SQL based on the plan.
  2. 31-Category Error Taxonomy:

    • Covers 9 major categories: syntax errors, schema linking errors, JOIN errors, filter condition errors, aggregation logic errors, value representation errors, subquery errors, set operation errors, and structural omissions.
    • Uses concise error codes (rather than verbose descriptions) to conserve context window space.
    • The Correction Plan Agent is prompted to reference the taxonomy, diagnosing the error type before formulating a repair strategy.
  3. Guided Correction Loop:

    • Unlike DIN-SQL (regeneration only) and DAIL-SQL (execution feedback only), the correction loop provides error type + CoT repair plan.
    • A two-step correction process (Correction Plan Agent → Correction SQL Agent) is used instead of direct correction.
    • The loop continues until execution succeeds or the maximum number of attempts is reached.

Key Experimental Results

Spider Benchmark

Method Spider EA Spider-Realistic EA
DIN-SQL + GPT-4 82.8% 78.1%
DAIL-SQL + GPT-4 + SC 83.6% 75.2%
MAC-SQL + GPT-4 86.8% -
Tool-SQL + GPT-4 86.9% 82.9%
Chase SQL 87.6% -
SQL-of-Thought + Claude 3 Opus 91.59% 90.16%

Ablation Study (100 samples)

Configuration Accuracy Note
SQL-of-Thought (full) 95% Complete framework
w/o correction loop 85% −10%; correction is critical
w/o Query Plan 90% −5%; CoT planning is beneficial

Model Comparison

Model SQL-of-Thought EA
Claude 3 Opus 95%
GPT-5 89%
GPT-4o-mini 87%
GPT-3.5 67%
Llama-3.1-8B ~45%

Key Findings

  • Correction loop contributes +10%: SQL that is syntactically correct but logically erroneous requires structured diagnosis to repair.
  • CoT Query Plan contributes +5%: Planning before generating SQL is more reliable than direct generation.
  • Claude 3 Opus performs best: It demonstrates the strongest reasoning capability and SQL generation accuracy among all evaluated models.
  • Cost trade-off: A single Spider run costs ~\(42; a hybrid model strategy can reduce this to ~\)30 (85% EA).

Highlights & Insights

  • Value of the error taxonomy: The structured 31-category taxonomy transforms LLM behavior from "blind retry" to "targeted repair" — removing the taxonomy exacerbates repeated corrections of identical errors.
  • Two-step correction > direct correction: Generating a correction plan before generating SQL is more effective than directly feeding error information to the SQL Agent — LLMs benefit from structured intermediate reasoning steps.
  • Lessons from failed ablations: Assigning multiple repair agents to each fix a different error type and then merging results leads to conflicts; carrying correction history causes context bloat and performance degradation.
  • Large gap for open-source models: Llama-3.1-8B achieves only 45%, exposing the limitations of smaller models on complex structured generation tasks.

Limitations & Future Work

  • Evaluation limited to the Spider series: Spider does not reflect the complexity of real-world databases (the TAG framework indicates it covers only ~20% of real queries).
  • High API cost: The multi-agent framework costs ~$42 per run, limiting practical deployment.
  • Error taxonomy coverage unverified: Coverage across diverse query structures remains unknown.
  • Future directions: (1) Evaluation on BIRD-SQL and real-world databases; (2) fine-tuning smaller models to replace API calls and reduce cost; (3) automatic learning and updating of the error taxonomy.
  • vs. DIN-SQL: DIN-SQL's correction merely regenerates the prompt without specific error information; SQL-of-Thought provides taxonomy-guided, targeted correction.
  • vs. Chase SQL: Chase SQL employs a multi-candidate selection strategy, whereas SQL-of-Thought uses a single-path iterative correction approach — making it more efficient.
  • vs. Think2SQL: Think2SQL finds that reasoning offers mixed benefits for SQL generation; SQL-of-Thought eliminates the "unguided reasoning can be harmful" problem through staged reasoning (plan → SQL) and taxonomy-guided correction.
  • Insight: For structured output generation (SQL/code), a "diagnose → plan → repair" paradigm is more effective than "detect → retry."

Rating

  • Novelty: ⭐⭐⭐⭐ Taxonomy-guided correction is a novel and practical design.
  • Experimental Thoroughness: ⭐⭐⭐ Limited to the Spider series; lacks evaluation on BIRD-SQL and real-world databases.
  • Writing Quality: ⭐⭐⭐⭐ Architecture is clearly presented; ablation analysis is thorough, including lessons from failed ablations.
  • Value: ⭐⭐⭐⭐ Spider SOTA results are convincing, but generalization to harder benchmarks requires further validation.

Rating

  • Novelty: ⭐⭐⭐ Combination of multi-agent framework and error taxonomy.
  • Experimental Thoroughness: ⭐⭐⭐ Standard benchmark evaluation.
  • Writing Quality: ⭐⭐⭐
  • Value: ⭐⭐⭐ Practical reference value for text-to-SQL applications.