Skip to content

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

Conference: ACL2026
arXiv: 2605.00199
Code: https://github.com/JugalGajjar/RSAT
Area: LLM Reasoning / Table QA / Interpretable Attribution
Keywords: Table Reasoning, Cell-level Citation, GRPO, Faithfulness, Small Language Models

TL;DR

RSAT trains 1B-8B Small Language Models (SLMs) using "SFT with structured citation format + GRPO with NLI faithfulness as the core reward." This enables table QA to not only provide answers but also bind each reasoning step to specific table cells, improving average faithfulness from 0.224 in SFT to 0.826.

Background & Motivation

Background: Table question answering and fact verification have explored various paths such as TAPAS, TAPEX, TaBERT, Binder, Chain-of-Table, and TaPERA. The primary objective has been improving answer accuracy or enabling models to perform table operations. Recent LLM-based methods generate chain-of-thought sequences, but these typically yield natural language reasoning processes without explicit grounding.

Limitations of Prior Work: When presented with an answer, users find it difficult to discern which specific cells the model relied upon. Table reasoning is frequently applied in audit-required scenarios like finance, news, and medicine, where "correct answers" alone are insufficient. If reasoning steps cannot be mapped to evidence cells, it remains impossible to determine whether the model reasoned correctly, succeeded by chance, or fabricated explanations post-hoc.

Key Challenge: While structured formats are easily imitated through supervised learning, ensuring that "cited cells truly support the reasoning step" is a matter of semantic faithfulness rather than formatting. Experimental results in the paper show that while SFT achieves a format success rate of nearly 99%, the average faithfulness remains around 22%.

Goal: The authors aim to enable small models to generate auditable cell-level citations directly during reasoning generation, rather than appending citations after the answer is produced. Specifically, the model must simultaneously satisfy answer quality, JSON formatting, coordinate validity, evidence faithfulness, and citation conciseness.

Key Insight: RSAT treats "citation faithfulness" as a training objective. The authors first use a small set of verified structured traces to teach the model output formats, then utilize GRPO to optimize a composite reward on large-scale table QA data without traces, where NLI entailment directly measures whether cited cells support the reasoning step.

Core Idea: Attribution in table reasoning is transformed from a post-processing task into a reinforcement learning objective during generation. A computable faithfulness reward compels the small model to ground every reasoning step in actual cell evidence.

Method

The methodology of RSAT is logically structured: first, fix the output space into a verifiable structure, and then use rewards to make this structure trustworthy. Instead of redesigning table encoders, it fine-tunes existing instruction models into "small reasoners capable of citing table evidence" via LoRA.

Overall Architecture

The input consists of a serialized table and a natural language question. The table is flattened into text with markers like [HEADER] and [ROW 0], allowing the model to map [row, col] coordinates in the output back to the original cells. The output is a JSON object containing a set of reasoning_steps, where each step includes a natural language claim and a list of cited_cells coordinates, followed by a final_answer.

Training is split into two stages. The first stage is SFT, using 1,000 structured reasoning traces generated by Claude Opus 4.5 and verified by scripts to teach the model the required format. The second stage is GRPO, where the model samples 8 candidate outputs for the same question, scores them using a composite reward, and updates the policy based on relative advantages within the group. Evaluation covers Qwen 2.5 Instruct (1.5B/3B/7B) and Llama 3 Instruct (1B/3B/8B) using datasets from WTQ, FeTaQA, and TabFact.

Key Designs

  1. Structured Cell-level Reasoning Output:

    • Function: Consolidates answers, reasoning steps, and evidence coordinates into a parsable JSON, allowing each claim to be traced back to specific cells by humans or programs.
    • Mechanism: Table serialization explicitly preserves row numbers and column positions; each reasoning step in the output must provide [row, col] coordinates. Consequently, citations are not at the paragraph or passage level but targeting the smallest auditable units in the table structure.
    • Design Motivation: Errors in table reasoning often stem from selecting the wrong row or column or misinterpreting relationships between cells. Cell-level citations expose these errors and make "correct answer but incorrect evidence" visible.
  2. SFT for Format, GRPO for Quality:

    • Function: Decouples "output capability" from "output credibility" to avoid over-reliance on a small number of human or teacher traces.
    • Mechanism: SFT traces only need to pass JSON, coordinate boundary, and step count checks to stabilize format learning. The GRPO stage requires no traces, using only the gold answer and reward functions to evaluate model-generated candidates.
    • Design Motivation: The authors found that after SFT, format success and citation validity reached nearly 99%, but faithfulness was only 0.224, indicating that surface-level imitation of traces is insufficient for true grounding.
  3. Composite Reward Centered on NLI Faithfulness:

    • Function: Simultaneously optimizes for answer accuracy, citation validity, faithfulness, conciseness, and format constraints.
    • Mechanism: The reward is defined as \(R=R_{ans}+0.3R_{cite}+0.5R_{faith}+0.2R_{pars}+R_{fmt}\). Specifically, \(R_{faith}\) concatenates the values of cited cells into an evidence string and uses DeBERTa-v3-base NLI to determine if it entails the reasoning text; \(R_{pars}\) penalizes excessive citations per step; and JSON parsing failures trigger a hard format penalty.
    • Design Motivation: Rewarding only the answer leads the model to ignore citations, while rewarding only valid coordinates leads to random citations of real cells. NLI faithfulness transforms "whether citations support the claim" into an optimizable signal, representing the core design of the paper.

Loss & Training

SFT employs LoRA on linear layers including Q/K/V/O, gate, up, and down, training for 3 epochs with a learning rate of approximately \(2\times 10^{-4}\). For GRPO, a new LoRA is applied after merging SFT weights, training for 500 samples in 1 epoch. For each question, \(G=8\) candidates are generated with a learning rate of \(5\times 10^{-5}\) and a temperature of 0.9. The authors emphasize that GRPO is more efficient than PPO as it eliminates the critic, making it suitable for single-card H100 training; total training for the six main models and ablations was approximately 36.8 GPU-hours.

Key Experimental Results

Main Results

RSAT outperformed zero-shot, SFT-only, and post-hoc baselines across all six models. A key observation is that SFT nearly solves the formatting issue, but faithfulness remains low; GRPO significantly boosts faithfulness without sacrificing answer F1.

Model Method Answer F1 Citation Validity Faithfulness Conciseness Format Success
Qwen 1.5B SFT 0.371 0.995 0.149 0.918 0.998
Qwen 1.5B RSAT 0.524 0.996 0.847 0.990 0.998
Qwen 3B SFT 0.531 0.996 0.213 0.848 0.998
Qwen 3B RSAT 0.592 0.999 0.946 0.996 1.000
Qwen 7B SFT 0.576 1.000 0.234 0.888 1.000
Qwen 7B RSAT 0.619 0.992 0.977 0.992 0.992
Llama 8B SFT 0.555 0.996 0.288 0.830 0.998
Llama 8B RSAT 0.647 1.000 0.972 1.000 1.000

The contributions of training stages are compelling. SFT improves format success by +0.61, citation validity by +0.64, and F1 by +0.34 over zero-shot, but only increases faithfulness by +0.19. From SFT to RSAT, formatting remains stable while faithfulness increases by another +0.60 and F1 by +0.09.

Ablation Study

Configuration F1 Faithfulness Conciseness Description
Qwen 7B Full 0.619 0.977 0.992 Complete RSAT
Qwen 7B w/o Faithfulness Reward 0.635 0.117 1.000 F1 rises slightly, but grounding collapses
Qwen 7B w/o Conciseness Reward 0.612 0.952 0.604 Tends to over-cite (5-6 cells)
Qwen 7B w/o Citation Validity Reward 0.605 0.934 0.993 Minor impact due to SFT learning coordinates
Llama 8B Full 0.647 0.972 1.000 Complete RSAT
Llama 8B w/o Faithfulness Reward 0.638 0.031 0.996 Faithfulness drops from near perfect to zero

Key Findings

  • Post-hoc attribution is largely unusable for small models, with an average format success rate of only 12.7% (Qwen 3B at 0.4%). This suggests that "free reasoning followed by coordinate filling" places excessive demands on working memory and table look-up capabilities.
  • Qwen significantly outperforms Llama at smaller scales: at ~3B, Qwen reaches 0.946 faithfulness compared to Llama 3B's 0.735; they converge at the 7B/8B scale.
  • Faithfulness reward is the only irreplaceable signal. Without it, the model still generates structured JSON and valid coordinates, but these citations no longer support the reasoning.

Highlights & Insights

  • This paper converts attribution from "explaining generated results" into a "training objective during generation," which is a clean conceptual shift. It serves as a reminder that format constraints and evidence faithfulness are distinct capabilities; format success cannot substitute for grounding quality.
  • Using NLI as a step-level reward is highly practical. despite potential proxy bias, it creates a computable closed loop between table evidence, reasoning text, and RL targets suitable for SLM training.
  • The failure of the post-hoc baseline is a significant negative result. While many interpretability systems assume "citing after answering" is feasible, RSAT demonstrates that in SLMs and structured table scenarios, citations must be intrinsic to the generation process.
  • This paradigm is transferable to other structured evidence scenarios like knowledge graphs, code execution results, or tool-call logs: define an auditable output structure, then optimize with task-specific rewards.

Limitations & Future Work

  • The primary limitation is that both the training reward and the main evaluation metric rely on the same DeBERTa NLI scorer, creating a train-eval circularity. The model might learn to satisfy the scorer's linguistic preferences rather than human judgments of faithfulness.
  • Evaluation is limited to WTQ, FeTaQA, and TabFact, leaving generalization to complex schemas like financial statements or clinical tables unverified.
  • Exact Match (EM) is very low (0.000-0.018), suggesting instability in matching gold strings. The paper relies on F1, which complicates direct comparisons with traditional table QA systems.
  • The conciseness reward may lead to overly short outputs or excessive compression of reasoning steps. Balancing "succinctness" and "explanatory power" requires further study.
  • Human evaluation is a necessary next step, particularly to verify if NLI-deemed faithful citations are acceptable to human auditors.
  • vs TAPAS / TAPEX / TaBERT: These focus on table understanding and accuracy, whereas RSAT focuses on the step-level evidence behind the answer. RSAT complements rather than replaces table encoders by adding an auditable output layer.
  • vs Chain-of-Table / TaPERA: These emphasize reasoning via table transformations or program decomposition. RSAT emphasizes citing original cells for every step, providing finer granularity.
  • vs Self-RAG / ALCE / RARR: These textual attribution methods typically handle passage-level evidence. RSAT extends attribution to cell-level structured evidence via direct RL training.
  • vs Post-hoc Attribution: Post-hoc methods require a second pass to map text to coordinates. RSAT generates them synchronously in the first pass, making it more suitable for small models.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Combines GRPO and cell-level faithful attribution naturally; the core value is target design rather than a totally new algorithm.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Solid results across six models and three datasets with post-hoc controls and reward ablations, though lacking human evaluation and cross-domain validation.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definitions, stage-wise contributions, and ablation conclusions; negative results are well-explained.
  • Value: ⭐⭐⭐⭐⭐ Highly insightful for deploying SLMs in auditable table reasoning, especially regarding grounded reasoning design in high-stakes domains.