AgentiQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation¶

Conference: NeurIPS 2025 arXiv: 2510.10661 Code: To be released Area: Interpretability Keywords: Text-to-SQL, multi-expert, question decomposition, adaptive routing, Spider benchmark

TL;DR¶

This paper proposes AgentiQL, a multi-expert agent framework for Text-to-SQL: a reasoning agent decomposes questions into sub-problems, a coding agent generates sub-queries, a refinement step corrects column selection, and an adaptive router intelligently routes between a baseline parser and the modular pipeline. Using a 14B open-source model, AgentiQL achieves 86.07% EX on Spider, approaching GPT-4 SOTA (89.65%).

Background & Motivation¶

Background: LLMs have significantly advanced NL2SQL capabilities, yet monolithic LLM architectures still struggle with complex reasoning and diverse schema handling. Static ensemble methods incur high computational overhead, and deploying large-scale LLMs remains costly in practice.

Limitations of Prior Work: (1) Monolithic LLMs perform poorly on complex multi-table joins and nested aggregations; (2) existing systems lack interpretability—users cannot trace how SQL is generated; (3) serving large models (e.g., GPT-4) is costly and impractical for real-world deployment.

Key Challenge: Large models deliver strong performance but at high cost, while small models are economical but insufficient for complex queries.

Goal: Achieve large-model-level NL2SQL performance using a small model (14B parameters) through multi-expert collaboration, while maintaining interpretability and efficiency.

Key Insight: Decompose SQL generation into three specialized components—reasoning (question decomposition), coding (sub-query generation), and refinement (column selection correction)—and use an adaptive router to select the execution path based on query complexity.

Core Idea: Apply Divide-and-Merge to decompose complex SQL generation into sub-problem–sub-query pairs, combined with Column Selection refinement and Adaptive Routing, enabling a 14B open-source model to approach GPT-4-level performance.

Method¶

Overall Architecture¶

Query input → Adaptive Router assesses complexity → simple queries follow the baseline for direct generation / complex queries follow the AgentiQL pipeline (Table Selection → Question Decomposition → Sub-Query Generation → Merge → Column Selection).

Key Designs¶

Divide:
- Function: Decompose complex natural language queries into manageable sub-problems.
- Mechanism: Three steps—(1) Table Selection: a reasoning LLM filters irrelevant tables to produce a reduced schema \(\tilde{s}\); (2) Question Decomposition: a reasoning LLM decomposes the query into sub-problems \(\{x_1, ..., x_k\}\); (3) Query Generation: a coding LLM generates a SQL sub-query for each sub-problem, with up to \(R=3\) retries on failure.
- Design Motivation: Complex queries require multi-step reasoning; decomposition simplifies each step and exposes intermediate reasoning, improving interpretability.
Merge:
- Function: Combine sub-queries into a final complete SQL statement.
- Mechanism: Two strategies—(1) Last Sub-query: takes the final sub-query as the complete answer (assuming the reasoning agent orders sub-problems such that the last corresponds to the full solution); (2) Planner & Executor: a reasoning LLM acts as planner to decide how to merge, while a coding LLM acts as executor to implement the merge. The latter is more general but computationally costlier.
- Design Motivation: Last Sub-query is effective in many cases where sub-problems are naturally progressive, while Planner & Executor handles complex scenarios requiring genuine merging of multiple sub-queries.
Column Selection:
- Function: Ensure that the columns and their ordering in the SELECT clause match user intent.
- Mechanism: A reasoning LLM inspects the SELECT clause of the merged SQL and adjusts column selection and ordering to precisely match query requirements.
- Design Motivation: The merge process may introduce extraneous columns or incorrect ordering; this refinement step corrects such issues and consistently yields 2–5% EX improvement.
Adaptive Routing:
- Function: Route queries to either baseline direct generation or the AgentiQL pipeline based on query complexity.
- Mechanism: An XGBoost classifier or a reasoning agent acting as judge can be used, leveraging simple signals such as schema size (number of tables) as a complexity proxy.
- Design Motivation: Simple queries are handled more efficiently and accurately by the baseline; the full pipeline is activated only for complex queries, balancing efficiency and accuracy.

Loss & Training¶

No training is required (inference-time method). The Qwen2.5-Coder series (7B/14B/32B) is used as the coding LLM with few-shot prompting.

Key Experimental Results¶

Main Results (Spider Benchmark)¶

Method	7B EX	14B EX	32B EX
Baseline (Qwen2.5-Coder)	80.69	82.10	84.57
Last Sub-query + CS	74.44	80.21	83.16
Planner & Executor + CS	75.85	83.40	86.07
GPT-4 SOTA (CHASE-SQL)	-	-	89.65

Ablation Study¶

Configuration	7B EX	14B EX
Last Sub-query w/o CS	72.26	78.38
Last Sub-query + CS	74.44 (+2.18)	80.21 (+1.83)
Planner & Executor w/o CS	66.77	79.39
Planner & Executor + CS	75.85 (+9.08)	83.40 (+4.01)

Key Findings¶

Column Selection contributes consistently: improvements of 2–9% EX across all configurations, and is especially critical for the Planner & Executor strategy (correcting column errors introduced by the planner).
14B + Planner & Executor + CS reaches 86.07%: closing the gap with GPT-4 SOTA (89.65%) to 3.5% using an open-source small model.
Routing improves robustness: the baseline performs better on simple queries while AgentiQL excels on complex ones; routing combines the strengths of both.
Interpretability is enhanced: intermediate sub-problems and sub-queries expose the reasoning process.

Highlights & Insights¶

The Column Selection refinement step, though seemingly simple, yields substantial gains (up to +9%), suggesting that the last mile of NL2SQL lies in SELECT clause alignment—an aspect that is frequently overlooked.
The Adaptive Routing design is highly practical: not all queries require a complex pipeline; allocating compute on demand is an effective strategy.
Extensible to parallel execution: sub-query generation can be parallelized to improve throughput.

Limitations & Future Work¶

Evaluation limited to Spider: validation on additional benchmarks such as BIRD and SQL-Eval is needed.
Large model overhead remains prohibitive: a 235B model requires approximately 60 minutes per query on 4 A100 GPUs.
Decomposition failure is the primary failure mode: if the reasoning agent decomposes a question incorrectly, all subsequent steps fail.
Suggested direction: Incorporate RL to optimize query generation quality (e.g., SkyRL-SQL).

vs. DIN-SQL: DIN-SQL also performs question decomposition, but AgentiQL additionally introduces Column Selection and Adaptive Routing.
vs. CHASE-SQL: CHASE-SQL achieves SOTA using GPT-4; AgentiQL approaches its performance with a 14B open-source model.
vs. CodeS: CodeS employs fine-tuning, whereas AgentiQL is a pure inference-time method requiring no training.

Rating¶

Novelty: ⭐⭐⭐ The multi-expert decomposition idea is not novel per se, but the Column Selection + Routing combination is effective
Experimental Thoroughness: ⭐⭐⭐ Limited to Spider, but ablations are thorough
Writing Quality: ⭐⭐⭐⭐ Clear and well-structured
Value: ⭐⭐⭐⭐ A practical solution for approaching GPT-4 SOTA with small open-source models