PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents¶
Conference: ACL 2026 arXiv: 2604.17653 Code: GitHub Area: Text-to-SQL / Agent Keywords: Text-to-SQL, database probing, rule-based verification, semantic constraints, agent framework
TL;DR¶
This paper proposes PV-SQL, an agent-based Text-to-SQL framework that combines two complementary components — Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching and constructing a checklist) — achieving 5% higher execution accuracy and 20.8% higher valid efficiency score over the best baseline on the BIRD benchmark.
Background & Motivation¶
Background: Text-to-SQL has made significant progress with LLMs, yet persistent challenges remain — schema understanding, value anchoring (mapping natural language to exact database values), and constraint satisfaction (ensuring SQL faithfully captures all semantics).
Limitations of Prior Work: (1) Approximately 41% of failures stem from database misunderstanding — models do not know whether "California" is stored as "CA" or its full name; (2) even when understanding is correct, SQL generation lacks a verification mechanism, potentially producing syntactically valid but semantically incorrect queries; (3) existing verification methods (LLM self-verification / test case generation) are either unreliable or computationally expensive.
Key Challenge: Schema descriptions (DDL) alone do not contain actual data values, yet including all values in the prompt is infeasible. What is needed is on-demand, question-driven exploration of database contents.
Goal: Resolve comprehension errors through adaptive database probing, and resolve synthesis errors through deterministic rule-based verification.
Key Insight: Probe enhances the input (enriching context with real database evidence), while Verify enhances the output (ensuring semantic constraints are satisfied) — the two components address complementary failure types.
Core Idea: The SQL agent first "examines what the data looks like" before writing a query — mirroring the workflow of a human data analyst — and then reviews constraints one by one like a code reviewer.
Method¶
Overall Architecture¶
The framework operates in two stages: (1) Probe stage — the agent iteratively generates temporary SQL queries to explore the database (up to 5 rounds), discovering value formats and column semantics and accumulating findings into context \(G\); (2) Verify & Repair stage — 10 categories of verifiable constraints (e.g., DISTINCT / TOP-K / COUNT) are extracted from the question via pattern matching, a checklist is constructed, and after SQL generation the checklist is checked; unsatisfied constraints trigger iterative repair (up to 5 rounds).
Key Designs¶
-
Database Probing:
- Function: On-demand discovery of value formats and semantic information within the database.
- Mechanism: The agent decides in a loop whether additional probing is needed. If so, it generates SELECT queries with LIMIT clauses to retrieve relevant record samples. After execution, findings (e.g., "California is stored as CA"; "'late' means ship_date > required_date") are summarized and accumulated into context \(G\).
- Design Motivation: Unlike static context augmentation (similarity-based retrieval), probing is question-adaptive — different questions require exploration of different database aspects.
-
Verify & Repair:
- Function: Ensure that the generated SQL satisfies semantic constraints implicit in the question.
- Mechanism: Pattern matching extracts 10 constraint categories from the question (DISTINCT → "unique"/"distinct"; TOP-K → "top/first N"; COUNT → "how many"; etc.), forming a deterministic checklist. The SQL then passes through a pipeline of syntax checking → execution checking → constraint checking; each violation generates a descriptive error message to guide repair.
- Design Motivation: Rule-based verification is reliable (deterministic pattern matching), lightweight (no LLM calls required), and interpretable (each constraint has a clear source). Precision is prioritized over recall — missing a constraint is acceptable, while false positives introduce unnecessary repairs.
-
Complementarity of Probe and Verify:
- Function: Address comprehension errors (\(\varepsilon_D + \varepsilon_Q\)) and synthesis errors (\(\varepsilon_S\)) respectively.
- Mechanism: Probe resolves 41.3% of database misunderstandings and part of the 24.8% of question misunderstandings through real database evidence; Verify resolves 33.9% of synthesis errors through constraint checklists.
- Design Motivation: Error-analysis-driven design — each component targets a distinct major failure mode.
Loss & Training¶
The framework is training-free and supports 6 base LLMs (GPT-4o/4.1, Claude 3.5/3.7, Gemini 2.0/2.5).
Key Experimental Results¶
Main Results¶
BIRD Benchmark Execution Accuracy
| Method | Execution Accuracy (%) | Valid Efficiency Score |
|---|---|---|
| Best Baseline (TS-SQL) | ~60 | ~66 |
| PV-SQL | 65.12 | 86.9 |
Ablation Study¶
| Configuration | Execution Accuracy | Notes |
|---|---|---|
| PV-SQL | 65.12 | Full model |
| w/o Probe | 60.8 | Remove probing, −4.3 pp |
| w/o Verify | 62.1 | Remove verification, −3.0 pp |
| w/o Both | 57.3 | Reverts to baseline |
Key Findings¶
- Probe and Verify contribute approximately 4.3 pp and 3.0 pp of improvement respectively, and their combined effect is close to the sum of the individual gains.
- PV-SQL consumes fewer tokens than TS-SQL — rule-based verification is more efficient than LLM-generated test cases.
- Probe yields the largest gains on "hard" difficulty questions, which most require value anchoring.
- Constraint extraction achieves precision > 90%, confirming the correctness of the precision-first strategy.
Highlights & Insights¶
- "Examine the data before writing the query" is a highly practical and intuitive strategy that emulates the workflow of a human data analyst.
- Rule-based verification as a form of "test cases" for SQL is a clever analogy — SQL naturally lacks test cases, yet the question itself encodes verifiable constraints.
- The pattern-matching rules for 10 constraint categories are simple yet effective, reflecting an engineering philosophy of preferring straightforward solutions.
Limitations & Future Work¶
- Rule-based verification covers only 10 constraint categories; complex semantic constraints still rely on LLM understanding.
- A maximum of 5 probing rounds may be insufficient for very complex databases.
- Evaluation is conducted exclusively on the BIRD benchmark family.
- Probing queries may expose sensitive data.
Related Work & Insights¶
- vs. TS-SQL: TS-SQL uses LLMs to generate test cases for verification; PV-SQL's rule-based verification is more reliable and lightweight.
- vs. DIN-SQL: DIN-SQL decomposes questions but does not probe the database; PV-SQL adds a database understanding dimension.
- vs. MAC-SQL: MAC-SQL employs a multi-agent framework but lacks an explicit verification mechanism.
Rating¶
- Novelty: ⭐⭐⭐⭐ The complementary Probe+Verify design is novel; the rule-based verification angle is practically motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 baselines + 6 LLMs + 3 benchmarks + detailed ablation + error analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem-driven narrative with clear motivation and vivid examples.
- Value: ⭐⭐⭐⭐⭐ Directly advances the practical deployment of Text-to-SQL systems.