PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents¶

Conference: ACL 2026 arXiv: 2604.17653 Code: GitHub Area: Text-to-SQL / Agent Keywords: Text-to-SQL, database probing, rule-based verification, semantic constraints, agent framework

TL;DR¶

This paper proposes PV-SQL, an agent-based Text-to-SQL framework that combines two complementary components — Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching and constructing a checklist) — achieving 5% higher execution accuracy and 20.8% higher valid efficiency score over the best baseline on the BIRD benchmark.

Background & Motivation¶

Background: Text-to-SQL has made significant progress with LLMs, yet persistent challenges remain — schema understanding, value anchoring (mapping natural language to exact database values), and constraint satisfaction (ensuring SQL faithfully captures all semantics).

Limitations of Prior Work: (1) Approximately 41% of failures stem from database misunderstanding — models do not know whether "California" is stored as "CA" or its full name; (2) even when understanding is correct, SQL generation lacks a verification mechanism, potentially producing syntactically valid but semantically incorrect queries; (3) existing verification methods (LLM self-verification / test case generation) are either unreliable or computationally expensive.

Key Challenge: Schema descriptions (DDL) alone do not contain actual data values, yet including all values in the prompt is infeasible. What is needed is on-demand, question-driven exploration of database contents.

Goal: Resolve comprehension errors through adaptive database probing, and resolve synthesis errors through deterministic rule-based verification.

Key Insight: Probe enhances the input (enriching context with real database evidence), while Verify enhances the output (ensuring semantic constraints are satisfied) — the two components address complementary failure types.

Core Idea: The SQL agent first "examines what the data looks like" before writing a query — mirroring the workflow of a human data analyst — and then reviews constraints one by one like a code reviewer.

Method¶

Overall Architecture¶

The framework operates in two stages: (1) Probe stage — the agent iteratively generates temporary SQL queries to explore the database (up to 5 rounds), discovering value formats and column semantics and accumulating findings into context \(G\); (2) Verify & Repair stage — 10 categories of verifiable constraints (e.g., DISTINCT / TOP-K / COUNT) are extracted from the question via pattern matching, a checklist is constructed, and after SQL generation the checklist is checked; unsatisfied constraints trigger iterative repair (up to 5 rounds).

Key Designs¶

Database Probing:
- Function: On-demand discovery of value formats and semantic information within the database.
- Mechanism: The agent decides in a loop whether additional probing is needed. If so, it generates SELECT queries with LIMIT clauses to retrieve relevant record samples. After execution, findings (e.g., "California is stored as CA"; "'late' means ship_date > required_date") are summarized and accumulated into context \(G\).
- Design Motivation: Unlike static context augmentation (similarity-based retrieval), probing is question-adaptive — different questions require exploration of different database aspects.
Verify & Repair:
- Function: Ensure that the generated SQL satisfies semantic constraints implicit in the question.
- Mechanism: Pattern matching extracts 10 constraint categories from the question (DISTINCT → "unique"/"distinct"; TOP-K → "top/first N"; COUNT → "how many"; etc.), forming a deterministic checklist. The SQL then passes through a pipeline of syntax checking → execution checking → constraint checking; each violation generates a descriptive error message to guide repair.
- Design Motivation: Rule-based verification is reliable (deterministic pattern matching), lightweight (no LLM calls required), and interpretable (each constraint has a clear source). Precision is prioritized over recall — missing a constraint is acceptable, while false positives introduce unnecessary repairs.
Complementarity of Probe and Verify:
- Function: Address comprehension errors (\(\varepsilon_D + \varepsilon_Q\)) and synthesis errors (\(\varepsilon_S\)) respectively.
- Mechanism: Probe resolves 41.3% of database misunderstandings and part of the 24.8% of question misunderstandings through real database evidence; Verify resolves 33.9% of synthesis errors through constraint checklists.
- Design Motivation: Error-analysis-driven design — each component targets a distinct major failure mode.

Loss & Training¶

The framework is training-free and supports 6 base LLMs (GPT-4o/4.1, Claude 3.5/3.7, Gemini 2.0/2.5).

Key Experimental Results¶

Main Results¶

BIRD Benchmark Execution Accuracy

Method	Execution Accuracy (%)	Valid Efficiency Score
Best Baseline (TS-SQL)	~60	~66
PV-SQL	65.12	86.9

Ablation Study¶

Configuration	Execution Accuracy	Notes
PV-SQL	65.12	Full model
w/o Probe	60.8	Remove probing, −4.3 pp
w/o Verify	62.1	Remove verification, −3.0 pp
w/o Both	57.3	Reverts to baseline

Key Findings¶

Probe and Verify contribute approximately 4.3 pp and 3.0 pp of improvement respectively, and their combined effect is close to the sum of the individual gains.
PV-SQL consumes fewer tokens than TS-SQL — rule-based verification is more efficient than LLM-generated test cases.
Probe yields the largest gains on "hard" difficulty questions, which most require value anchoring.
Constraint extraction achieves precision > 90%, confirming the correctness of the precision-first strategy.

Highlights & Insights¶

"Examine the data before writing the query" is a highly practical and intuitive strategy that emulates the workflow of a human data analyst.
Rule-based verification as a form of "test cases" for SQL is a clever analogy — SQL naturally lacks test cases, yet the question itself encodes verifiable constraints.
The pattern-matching rules for 10 constraint categories are simple yet effective, reflecting an engineering philosophy of preferring straightforward solutions.

Limitations & Future Work¶

Rule-based verification covers only 10 constraint categories; complex semantic constraints still rely on LLM understanding.
A maximum of 5 probing rounds may be insufficient for very complex databases.
Evaluation is conducted exclusively on the BIRD benchmark family.
Probing queries may expose sensitive data.

vs. TS-SQL: TS-SQL uses LLMs to generate test cases for verification; PV-SQL's rule-based verification is more reliable and lightweight.
vs. DIN-SQL: DIN-SQL decomposes questions but does not probe the database; PV-SQL adds a database understanding dimension.
vs. MAC-SQL: MAC-SQL employs a multi-agent framework but lacks an explicit verification mechanism.

Rating¶

Novelty: ⭐⭐⭐⭐ The complementary Probe+Verify design is novel; the rule-based verification angle is practically motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 baselines + 6 LLMs + 3 benchmarks + detailed ablation + error analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Problem-driven narrative with clear motivation and vivid examples.
Value: ⭐⭐⭐⭐⭐ Directly advances the practical deployment of Text-to-SQL systems.