SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL¶
Conference: ACL 2025
arXiv: 2506.00391
Code: GitHub
Area: Others
TL;DR¶
Proposed the SHARE framework, which uses three dedicated Small Language Models (SLMs) with <8B parameters to form a sequential pipeline. It translates declarative SQL into step-by-step action trajectories that expose the reasoning path, and then corrects schema linking errors and logical reasoning errors in stages, achieving self-correction for LLM Text-to-SQL at an extremely low cost.
Background & Motivation¶
- High Cost of LLM Self-Correction: Existing self-correction methods rely on recursive calls to LLMs, requiring full inference in each round, which leads to multiplicative computational overhead; for instance, the Multiple-Prompt method costs up to $62.86 per thousand queries.
- Unreliable and Restricted Execution Feedback: Self-debugging relies on database execution feedback, but mainstream engines like SQLite return overly simplified information, making it difficult to pinpoint semantic errors precisely; additionally, in data privacy-sensitive scenarios, direct database access is strictly restricted.
- Black-box Dilemma of Declarative SQL: When LLMs directly correct declarative SQL, they cannot exhibit the underlying reasoning process, leading to a self-enhancement bias ā GPT-4o's native self-correction actually decreases BIRD accuracy from 55.87% to 55.28%.
- Heavy Reliance on Prompt Engineering: Existing high-performance methods (such as MAGIC) rely on elaborately designed multi-stage prompts, which incur high labor costs and are difficult to generalize to other generator models or SQL dialects.
Method¶
Overall Architecture¶
SHARE adopts an assistant-based self-correction paradigm: after an arbitrary generator LLM generates an initial SQL, three LoRA-finetuned dedicated SLMs (all <8B parameters) form a sequential pipeline for correction. Finally, the corrected action trajectory is used as feedback to guide the LLM to regenerate the SQL.
Key Designs¶
1. Base Action Model (BAM) ā Exposing Reasoning from Declarative to Procedural
BAM decomposes declarative SQL queries into step-by-step action trajectories similar to pandas APIs (e.g., where ā groupby ā orderby ā select), making the implicit reasoning paths explicit. The training data consists of 13K high-quality query-trajectory pairs distilled via GPT-4o, and reversibility verification (trajectories must be restorable back to the original SQL) is applied to filter out hallucinated samples. BAM also acts as a data factory for subsequent models, automatically synthesizing training data for SAM and LOM through a hierarchical self-evolution strategy, avoiding repeated calls to the teacher LLM.
2. Schema Augmentation Model (SAM) ā Two-stage Schema Linking Correction
SAM focuses on detecting and correcting schema linking errors (mismatched table names, column names, etc.) in action trajectories. It adopts a two-stage training approach: Phase 1 learns to precisely mark all schema element positions in the trajectory using [MASK]; Phase 2 fills the masked positions with correct schema links based on the database schema, user questions, and the schema list of the initial SQL. This decoupled "locate-then-correct" design allows the model to optimize each subtask independently.
3. Logic Optimization Model (LOM) ā Logic Correction with Action Perturbation Augmentation
LOM addresses logical reasoning errors (order of operations, conditional logic, aggregation methods, etc.). To expand the training data, an action perturbation strategy is proposed: three types of perturbations, ADD (inserting redundant actions), DELETE (deleting necessary actions), and SUBSTITUTE (replacing action types or parameters), are applied to correct trajectories to simulate real error patterns. Ultimately, 15K error-correct trajectory pairs are collected for training.
Training & Inference¶
| Phase | Model | Data Source | Data Volume | Training Method |
|---|---|---|---|---|
| Stage 1 | BAM | GPT-4o Distillation + Reversibility Verification | 13K | LoRA Fine-tuning |
| Stage 2 | SAM | BAM Hierarchical Self-Evolution + Schema Masking | 13K | LoRA Two-stage Fine-tuning |
| Stage 3 | LOM | BAM Self-Evolution + Action Perturbation Augmentation | 15K | LoRA Fine-tuning |
During inference, the three models execute sequentially: BAM generates action trajectories ā SAM corrects schema ā LOM corrects logic ā the corrected trajectory is fed back to the LLM to regenerate SQL, achieving a single-round interaction overall.
Key Experimental Results¶
Main Results: GPT-4o Generator + Single-round Correction¶
| Method | External Feedback | BIRD EX(%) | SPIDER EX(%) | Cost per 1K Queries |
|---|---|---|---|---|
| GPT-4o (baseline) | ā | 55.87 | 77.10 | ā |
| Self-Correction | ā | 55.28 ā | 75.90 | ā |
| Self-Consistency | ā | 58.75 | 81.80 | ā |
| Multiple-Prompt | ā | 58.80 | 81.50 | $62.86 |
| Self-Debugging | ā | 58.28 | 81.20 | ā |
| MAC-Refiner | ā | 58.74 | 80.40 | $20.18 |
| MAGIC | ā | 59.53 | 85.66 | $37.99 |
| +SHARE-3.8B | ā | 60.89 | 84.00 | ā |
| +SHARE-8B | ā | 64.14 | 85.90 | $2.57 |
SHARE-8B achieves a 14.80% relative improvement on BIRD, with inference costs at only 1/10 of the most economical baseline.
Cross-Model Generalization & Ablation Study¶
| Experimental Dimension | Setting | BIRD EX(%) | Gain |
|---|---|---|---|
| Cross-Model | Claude-3.5-S ā +SHARE-8B | 49.41 ā 63.56 | +28.64% |
| GPT-4o-mini ā +SHARE-8B | 49.09 ā 59.64 | +21.49% | |
| Llama-3.1-70B ā +SHARE-8B | 53.91 ā 61.93 | +14.88% | |
| DS-Coder-6.7B ā +SHARE-8B | 34.57 ā 51.24 | +48.22% | |
| Robustness | DK Dataset +SHARE-8B | 64.10 ā 75.30 | +11.20% |
| Realistic Dataset +SHARE-8B | 73.40 ā 81.50 | +8.10% | |
| Ablation | w/o SAM (Schema Aug) | 60.02 | ā 4.08 |
| w/o LOM (Logic Opt) | 56.98 | ā 7.16 | |
| w/o Hierarchical Self-Evolution | 60.55 | ā 3.59 | |
| w/o Action Perturbation | 61.38 | ā 2.76 | |
| Data Efficiency | 50% Training Data | 60.71 | Outperforms MAGIC |
| Open-source Teacher | SHARE-llama (Llama-70B Teacher) | 65.19 | Outperforms SHARE-gpt |
Key Findings¶
- GPT-4o native self-correction leads to performance degradation (55.87ā55.28), validating LLM's self-enhancement bias on declarative SQL.
- SHARE achieves BIRD +14.80% and SPIDER +11.41% in a single round, with an inference cost of only $2.57/1K queries.
- Utilizing only 50% of the training data outperforms SOTA (MAGIC), demonstrating that the hierarchical self-evolution strategy substantially boosts data efficiency.
- It generalizes effectively across different models (closed-source/open-source) and SQL dialects (MySQL/PostgreSQL), rather than status-fitting specific error patterns.
- Using the open-source Llama-70B as a teacher model instead of GPT-4o, SHARE-llama yields performance comparable to or even better than SHARE-gpt.
Highlights & Limitations¶
Highlights:
- The core innovation lies in the paradigm shift from "declarative to procedural", converting black-box SQL correction into white-box action trajectory debugging, which dramatically enhances error localization accuracy.
- The collaborative paradigm of SLM assisting LLM is exceptionally cost-effective, with inference costs at only 6.8% of MAGIC.
- The hierarchical self-evolution strategy decouples training data construction from reliance on teacher LLMs, reducing training phase costs to only 14.7% of MAGIC.
Limitations:
- Only single-round correction was validated; multi-round iterative correction scenarios remain unexplored.
- The predefined pandas-like action space might not cover all the complex features of various SQL dialects.
- The correction effect on mathematical reasoning errors (Mathematical Delusion) is limited (only ā1.63%), constrained by the mathematical capabilities of the generator model.
Rating¶
| Dimension | Score (1-10) | Description |
|---|---|---|
| Novelty | 7 | The design of declarative-to-procedural translation + three-model sequential pipeline is innovative, but the action model and LoRA fine-tuning are mature techniques. |
| Effectiveness | 9 | Comprehensive improvements across four benchmarks, strong generalization across models/dialects, outperforming SOTA with only 50% data, and complete ablation studies. |
| Engineering Value | 8 | Extremely low inference cost ($2.57/1K queries), plug-and-play to assist any LLM, and the open-source teacher model option further lowers the barrier to entry. |
| Reproducibility | 8 | The code is open-source, training details are sufficient, and hyperparameters and prompts are fully provided in the appendix. |
| - Currently only evaluated on SQLite-related benchmarks, while effectiveness on other database systems remains unknown. |
Related Work & Insights¶
- Self-debugging: Iterative correction based on execution feedback (Zhong et al., 2023; Li & Xie, 2024)
- Self-correction: Autonomous correction without execution feedback (Liu & Tan, 2024; Askari et al., 2024)
- Action model: Decomposing tasks into procedural action trajectories (Zhang et al., 2024)
- MAGIC: Current SOTA self-correction method for text-to-SQL
Rating¶
- Novelty: ā ā ā ā ā ā The design of action trajectory transformation coupled with modular correction is highly creative.
- Technical Depth: ā ā ā ā ā ā The three-stage pipeline and self-evolution training strategy are meticulously designed.
- Experimental Thoroughness: ā ā ā ā ā ā 4 benchmarks + multiple generators + resource-constrained analysis + cross-dialect testing.
- Value: ā ā ā ā ā ā Low-cost auxiliary correction is highly valuable in privacy-restricted practical scenarios.