PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents¶
Conference: ACL 2026
arXiv: 2604.17653
Code: GitHub
Area: Text-to-SQL / Agent
Keywords: Text-to-SQL, Database Probing, Rule-based Verification, Semantic Constraints, Agent Framework
TL;DR¶
This paper proposes PV-SQL, an agent-based Text-to-SQL framework. By utilizing two complementary components: Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints based on pattern matching and constructing checklists), it achieves 5% higher Execution Accuracy and 20.8% higher Valid Efficiency Score than the best baseline on the BIRD benchmark.
Background & Motivation¶
Background: Text-to-SQL has made significant progress with the support of LLMs, but faces persistent challenges—schema understanding, value grounding (mapping natural language to precise database values), and constraint satisfaction (ensuring the SQL faithfully captures all semantics).
Limitations of Prior Work: (1) Approximately 41% of failures stem from database misunderstandings—models do not know if "California" is stored as "CA" or the full name; (2) Even with correct understanding, SQL generation itself lacks a verification mechanism—potentially producing syntactically correct but semantically incorrect queries; (3) Existing verification methods (LLM self-correction/test case generation) are either unreliable or computationally expensive.
Key Challenge: Relying solely on schema descriptions (DDL) does not include actual data values, yet it is infeasible to put all values into the prompt. There is a need for demand-driven, question-oriented exploration of database content.
Goal: To resolve understanding errors through adaptive database probing and to resolve synthesis errors through deterministic rule verification.
Key Insight: Probe enhances the input (enriching context with real database evidence), and Verify enhances the output (ensuring semantic constraints are satisfied)—the two address complementary failure types.
Core Idea: To allow the SQL Agent to "see what the data looks like" before writing the query, similar to a data analyst, and then check each constraint row by row like a code reviewer.
Method¶
Overall Architecture¶
The framework is divided into two stages: (1) Probe stage—the Agent iteratively generates temporary SQL queries to explore the database (up to 5 rounds), discovering value formats and column semantics which are accumulated in context \(G\); (2) Verify & Repair stage—10 types of verifiable constraints (e.g., DISTINCT/TOP-K/COUNT) are extracted from the question using pattern matching to build a checklist. After generating the SQL, the constraints are checked, and if not satisfied, the SQL is iteratively repaired (up to 5 rounds).
Key Designs¶
-
Adaptive Database Probing:
- Function: To discover value formats and semantic information in the database on demand.
- Mechanism: The Agent decides in a loop whether more probing is required. If so, it generates a SELECT query with a LIMIT clause to retrieve relevant record samples. After execution, it summarizes the findings (e.g., "California is stored as CA", "late means ship_date > required_date") and accumulates them into context \(G\).
- Design Motivation: Unlike static context enhancement (similarity-based retrieval), probing is question-adaptive—different questions require exploring different aspects of the database.
-
Verify & Repair:
- Function: To ensure the SQL satisfies the latent semantic constraints in the question.
- Mechanism: 10 types of constraints are extracted from the question via pattern matching (DISTINCT→"unique"/"distinct", TOP-K→"top/first N", COUNT→"how many", etc.) to generate a deterministic checklist. The SQL goes through a pipeline of Syntax Check → Execution Check → Constraint Check, where each violation generates a descriptive error message to guide the repair.
- Design Motivation: Rule verification is reliable (deterministic pattern matching), lightweight (no LLM calls required), and interpretable (each constraint has a clear source). It prioritizes precision over recall—missing a constraint is acceptable, but false positives would introduce unnecessary repairs.
-
Complementarity of Probe + Verify:
- Function: To address understanding errors (\(\epsilon_D + \epsilon_Q\)) and synthesis errors (\(\epsilon_S\)) respectively.
- Mechanism: Probe resolves 41.3% of database misunderstandings and part of the 24.8% question misunderstandings through real database evidence; Verify resolves 33.9% of synthesis errors via the constraint checklist.
- Design Motivation: Error-analysis-driven design—the two components each target a major failure mode.
Loss & Training¶
A training-free framework. It supports 6 types of base LLMs (GPT-4o/4.1, Claude 3.5/3.7, Gemini 2.0/2.5).
Key Experimental Results¶
Main Results¶
Execution Accuracy on BIRD Benchmark
| Method | Execution Accuracy (%) | Valid Efficiency Score |
|---|---|---|
| Best Baseline (TS-SQL) | ~60 | ~66 |
| PV-SQL | 65.12 | 86.9 |
Ablation Study¶
| Configuration | Execution Accuracy | Description |
|---|---|---|
| PV-SQL | 65.12 | Full |
| w/o Probe | 60.8 | Removed Probing, -4.3pp |
| w/o Verify | 62.1 | Removed Verification, -3.0pp |
| w/o Both | 57.3 | Reverted to Baseline |
Key Findings¶
- Probe and Verify contribute a gain of approximately 4.3pp and 3.0pp respectively, and their combined effect is close to the sum of their individual contributions.
- The token consumption of PV-SQL is lower than that of TS-SQL—rule verification is more efficient than LLM-generated test cases.
- Probe provides the largest gain on "hard" difficulty questions—these questions require the most value grounding.
- The precision of constraint extraction is > 90%—confirming the correctness of the priority-on-precision strategy.
Highlights & Insights¶
- "Looking at the data before writing the query" is a very practical and intuitive strategy—it simulates the workflow of a human data analyst.
- Using rule verification as "test cases" for SQL is a clever analogy—SQL naturally lacks test cases, but the question itself contains verifiable constraints.
- The pattern matching rules for the 10 types of constraints are simple but effective—embodying an engineering philosophy of "simple solutions first".
Limitations & Future Work¶
- Rule verification only covers 10 types of constraints; complex semantic constraints still rely on LLM understanding.
- A maximum of 5 rounds of probing may not be sufficient for extremely complex databases.
- Validated only on the BIRD series benchmarks.
- Probing queries may leak sensitive data information.
Related Work & Insights¶
- vs TS-SQL: Uses LLM-generated test cases for verification; PV-SQL uses rule-based verification, which is more reliable and lightweight.
- vs DIN-SQL: Decomposes the question but does not probe the database; PV-SQL adds a dimension of database understanding.
- vs MAC-SQL: A multi-agent framework but lacks an explicit verification mechanism.
Rating¶
- Novelty: ⭐⭐⭐⭐ The complementary design of Probe+Verify is novel, and the entry point of rule verification is practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 baselines + 6 LLMs + 3 benchmarks + detailed ablations + error analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem-driven, clear motivation, and vivid examples.
- Value: ⭐⭐⭐⭐⭐ Directly advances the practical application of Text-to-SQL.