Skip to content

PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents

Conference: ACL 2026 arXiv: 2604.17653 Code: GitHub Area: Text-to-SQL / Agent Keywords: Text-to-SQL, database probing, rule-based verification, semantic constraints, agent framework

TL;DR

This paper proposes PV-SQL, an agent-based Text-to-SQL framework that combines two complementary components — Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching and constructing a checklist) — achieving 5% higher execution accuracy and 20.8% higher valid efficiency score over the best baseline on the BIRD benchmark.

Background & Motivation

Background: Text-to-SQL has made significant progress with LLMs, yet persistent challenges remain — schema understanding, value anchoring (mapping natural language to exact database values), and constraint satisfaction (ensuring SQL faithfully captures all semantics).

Limitations of Prior Work: (1) Approximately 41% of failures stem from database misunderstanding — models do not know whether "California" is stored as "CA" or its full name; (2) even when understanding is correct, SQL generation lacks a verification mechanism, potentially producing syntactically valid but semantically incorrect queries; (3) existing verification methods (LLM self-verification / test case generation) are either unreliable or computationally expensive.

Key Challenge: Schema descriptions (DDL) alone do not contain actual data values, yet including all values in the prompt is infeasible. What is needed is on-demand, question-driven exploration of database contents.

Goal: Resolve comprehension errors through adaptive database probing, and resolve synthesis errors through deterministic rule-based verification.

Key Insight: Probe enhances the input (enriching context with real database evidence), while Verify enhances the output (ensuring semantic constraints are satisfied) — the two components address complementary failure types.

Core Idea: The SQL agent first "examines what the data looks like" before writing a query — mirroring the workflow of a human data analyst — and then reviews constraints one by one like a code reviewer.

Method

Overall Architecture

The framework operates in two stages: (1) Probe stage — the agent iteratively generates temporary SQL queries to explore the database (up to 5 rounds), discovering value formats and column semantics and accumulating findings into context \(G\); (2) Verify & Repair stage — 10 categories of verifiable constraints (e.g., DISTINCT / TOP-K / COUNT) are extracted from the question via pattern matching, a checklist is constructed, and after SQL generation the checklist is checked; unsatisfied constraints trigger iterative repair (up to 5 rounds).

Key Designs

  1. Database Probing:

    • Function: On-demand discovery of value formats and semantic information within the database.
    • Mechanism: The agent decides in a loop whether additional probing is needed. If so, it generates SELECT queries with LIMIT clauses to retrieve relevant record samples. After execution, findings (e.g., "California is stored as CA"; "'late' means ship_date > required_date") are summarized and accumulated into context \(G\).
    • Design Motivation: Unlike static context augmentation (similarity-based retrieval), probing is question-adaptive — different questions require exploration of different database aspects.
  2. Verify & Repair:

    • Function: Ensure that the generated SQL satisfies semantic constraints implicit in the question.
    • Mechanism: Pattern matching extracts 10 constraint categories from the question (DISTINCT → "unique"/"distinct"; TOP-K → "top/first N"; COUNT → "how many"; etc.), forming a deterministic checklist. The SQL then passes through a pipeline of syntax checking → execution checking → constraint checking; each violation generates a descriptive error message to guide repair.
    • Design Motivation: Rule-based verification is reliable (deterministic pattern matching), lightweight (no LLM calls required), and interpretable (each constraint has a clear source). Precision is prioritized over recall — missing a constraint is acceptable, while false positives introduce unnecessary repairs.
  3. Complementarity of Probe and Verify:

    • Function: Address comprehension errors (\(\varepsilon_D + \varepsilon_Q\)) and synthesis errors (\(\varepsilon_S\)) respectively.
    • Mechanism: Probe resolves 41.3% of database misunderstandings and part of the 24.8% of question misunderstandings through real database evidence; Verify resolves 33.9% of synthesis errors through constraint checklists.
    • Design Motivation: Error-analysis-driven design — each component targets a distinct major failure mode.

Loss & Training

The framework is training-free and supports 6 base LLMs (GPT-4o/4.1, Claude 3.5/3.7, Gemini 2.0/2.5).

Key Experimental Results

Main Results

BIRD Benchmark Execution Accuracy

Method Execution Accuracy (%) Valid Efficiency Score
Best Baseline (TS-SQL) ~60 ~66
PV-SQL 65.12 86.9

Ablation Study

Configuration Execution Accuracy Notes
PV-SQL 65.12 Full model
w/o Probe 60.8 Remove probing, −4.3 pp
w/o Verify 62.1 Remove verification, −3.0 pp
w/o Both 57.3 Reverts to baseline

Key Findings

  • Probe and Verify contribute approximately 4.3 pp and 3.0 pp of improvement respectively, and their combined effect is close to the sum of the individual gains.
  • PV-SQL consumes fewer tokens than TS-SQL — rule-based verification is more efficient than LLM-generated test cases.
  • Probe yields the largest gains on "hard" difficulty questions, which most require value anchoring.
  • Constraint extraction achieves precision > 90%, confirming the correctness of the precision-first strategy.

Highlights & Insights

  • "Examine the data before writing the query" is a highly practical and intuitive strategy that emulates the workflow of a human data analyst.
  • Rule-based verification as a form of "test cases" for SQL is a clever analogy — SQL naturally lacks test cases, yet the question itself encodes verifiable constraints.
  • The pattern-matching rules for 10 constraint categories are simple yet effective, reflecting an engineering philosophy of preferring straightforward solutions.

Limitations & Future Work

  • Rule-based verification covers only 10 constraint categories; complex semantic constraints still rely on LLM understanding.
  • A maximum of 5 probing rounds may be insufficient for very complex databases.
  • Evaluation is conducted exclusively on the BIRD benchmark family.
  • Probing queries may expose sensitive data.
  • vs. TS-SQL: TS-SQL uses LLMs to generate test cases for verification; PV-SQL's rule-based verification is more reliable and lightweight.
  • vs. DIN-SQL: DIN-SQL decomposes questions but does not probe the database; PV-SQL adds a database understanding dimension.
  • vs. MAC-SQL: MAC-SQL employs a multi-agent framework but lacks an explicit verification mechanism.

Rating

  • Novelty: ⭐⭐⭐⭐ The complementary Probe+Verify design is novel; the rule-based verification angle is practically motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 baselines + 6 LLMs + 3 benchmarks + detailed ablation + error analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Problem-driven narrative with clear motivation and vivid examples.
  • Value: ⭐⭐⭐⭐⭐ Directly advances the practical deployment of Text-to-SQL systems.