Theme-Explanation Structure for Table Summarization Using Large Language Models¶

Conference: ACL 2025
arXiv: 2501.10487
Code: None
Area: Table Understanding / Text Generation
Keywords: Table Summarization, Theme-Explanation Structure, CoT Reasoning, Korean Administrative Documents, In-Context Learning

TL;DR¶

The authors propose the Tabular-TX pipeline, which achieves deep table understanding via multi-step CoT reasoning, generates clear sentences using a journalist persona prompt, and structures the output into a Theme (adverbial theme) + Explanation (predicative explanation) format. On a Korean administrative table summarization benchmark, it achieves the best performance with a ROUGE-1 score of 0.51 without relying on fine-tuning, significantly outperforming fine-tuning and pure ICL methods.

Background & Motivation¶

Tables are the primary medium for conveying core information in the administrative domain, where a large amount of critical data is stored in structural table formats. The ability of LLMs to accurately summarize and explain tabular content is essential for data utilization. However, high-quality table summarization faces several core challenges:

Human Readability is Overlooked: Existing table-to-text generation research focuses on model architectures and automatic evaluation metrics, while rarely considering the comprehensibility of the generated text to human readers. Outputs may be factually correct but lack clarity, conciseness, and intuitiveness.

Compositional Deficiency in Table Understanding: Processing table summarization requires LLMs to simultaneously possess diverse capabilities such as table identification, mathematical reasoning, and commonsense inference. Analyzing data points in isolation without sufficiently integrating their relationships leads to biased explanations.

Specific Difficulties in Korean Administrative Tables: The implicit nature of Korean (such as subject omission), the discrepancy between administrative terminology and everyday language, and morphological complexities (e.g., particle ambiguity) impose further challenges.

Resource-Constrained Scenarios: In many practical applications, the annotated data and computational resources required for large-scale fine-tuning are unavailable.

The core idea of this paper is that rather than solely focusing on "making LLMs understand tables," it is equally crucial to consider "making the output user-friendly." By designing a structured output format—the Theme-Explanation (TX) structure—readability is maximized while maintaining accuracy. Specifically:

Theme Part: A noun phrase of the table title combined with a citation/basis expression (e.g., "According to..."), providing key contextual anchors.
Explanation Part: Structured analysis based on highlighted cells (enumeration/magnitude comparison/trend analysis), constituting the core content.

This structure borrows from the "inverted pyramid" writing principle in journalism—presenting the context (Theme) first, followed by the facts (Explanation)—ensuring readers can accurately understand the meaning of figures right from the first sentence.

Method¶

Overall Architecture¶

The Tabular-TX pipeline consists of four stages: 1. Data Preprocessing: Converting tables into key-value dictionary formats, handling merged cells, and retaining highlighted and relevant cells. 2. CoT Multi-step Reasoning: Conducting step-by-step reasoning to ensure deep table understanding by the LLM. 3. Journalist Persona Prompt: Guiding the LLM to generate clear, objective, and well-structured sentences. 4. TX Structured Output: Organizing the generated content into the Theme + Explanation format.

Key Designs¶

Data Preprocessing:
- Key-Value Pair Conversion: Converting tabular data into dictionary format. Since LLMs primarily process sequential text, directly processing raw table formats can lead to misunderstandings of hierarchical relationships.
- Merged Cell Handling: Merged cells spanning multiple rows or columns are replicated across all covered positions, ensuring the LLM correctly recognizes the dependencies between cells. For example, if a "2020" label spans columns 3 and 4, it must appear in both columns.
- Relevant Cell Filtering: Retaining only highlighted cells and their corresponding row/column header cells to reduce data complexity and focus on the summarization target.
Chain-of-Thought (CoT) Multi-step Reasoning:
- Step 1 - Cell Type Classification: Distinguishing between monetary values, percentages, categorical data, and textual notes to prevent bugs such as misinterpreting percentages as ordinary numbers.
- Step 2 - Analysis Method Selection: Choosing appropriate analysis methods according to data types:
  - Enumeration: Listing items individually.
  - Magnitude comparison: Ranking numerical values.
  - Trend analysis: Analyzing temporal changes.
- Step 3 - Data Standardization: Unifying monetary units and properly formatting percentages.
- Special Handling for Korean: (1) Categorizing professional terminology, (2) standardizing numerical expressions according to Korean usage habits, and (3) gradually integrating contextual clues to resolve ambiguities from ellipsis references.
Journalist Persona Prompt:
- Design Motivation: Table summarization shares common characteristics with straight news articles—both pursue concise, objective, and fact-based clear expressions.
- Effect: Summaries generated with general prompts capture core information but lack contextual clarity and coherence; the journalist persona prompt guides the model to specify information sources, clearly define numerical constraints, and integrate contextual details.
- Practical Example: Generating a vague summary without the persona, and generating a structured summary akin to a news report with the journalist persona.

Theme-Explanation Structure¶

Theme Part: - Form: Noun phrase of the table title + citation/basis expression (e.g., Korean "…에 따르면" = "According to..."). - Function: Providing key contextual anchors to ensure numerical values are correctly interpreted. For example, the Theme "According to refugee status statistics by nationality" provides an explicit context for the subsequent "2,437 applications, of which only 147 were approved." - Necessity: Unlike text summarization, table cells themselves do not provide enough context; lacking a Theme leads to sentence ambiguity.

Explanation Part: - Form: Structured analysis predicate based on highlighted cells. - Content: Selecting enumeration, magnitude comparison, or trend analysis depending on data comparability. - Example: "The net fiscal cost increased by 9.435 trillion KRW compared to the previous year, totaling 61.301 trillion KRW"—here, trend analysis is used to present changes.

Loss & Training¶

Tabular-TX is entirely based on In-Context Learning (ICL) and does not require any fine-tuning.
It provides a few-shot of tabular summarization examples and detailed structured instructions.
This makes it highly practical in environments with limited annotated data and computing resources.

Key Experimental Results¶

Main Results¶

Model	Method	ROUGE-1	ROUGE-L	BLEU	Average
KoBART (124M)	Full Fine-Tuning	0.37	0.28	0.35	0.33
EXAONE 3.0 7.8B	ICL	0.21	0.14	0.01	0.12
EXAONE 3.0 7.8B	LoRA Fine-Tuning	0.27	0.21	0.05	0.17
EXAONE 3.0 7.8B	Tabular-TX	0.51	0.39	0.44	0.45
Llama-3-Korean-8B	ICL	0.33	0.25	0.27	0.28
Llama-3-Korean-8B	Tabular-TX	0.48	0.37	0.42	0.43

Tabular-TX significantly outperforms all fine-tuning methods without requiring any fine-tuning.

Ablation Study¶

Comparison Dimension	Setting A	Setting B	Difference	Explanation
Tabular-TX vs. Pure ICL	0.45	0.12	+275%	Core value of the TX structured approach
Tabular-TX vs. LoRA	0.45	0.17	+165%	Non-fine-tuned method outperforming supervised method
Tabular-TX vs. Full Fine-Tuning	0.45	0.33	+36%	7.8B ICL outperforming 124M full fine-tuning
EXAONE vs. Llama-3-Korean	0.45	0.43	+5%	TX is effective across different models
With Theme Part vs. Without	Clearer	Ambiguous	-	Qualitative analysis, no quantitative ablation
With Journalist Persona vs. Without	Structured	Vague	-	Qualitative analysis, same as above

Key Findings¶

The 275% Gain of Tabular-TX: Compared to pure ICL, Tabular-TX improves the average score of EXAONE from 0.12 to 0.45, showcasing an impressive performance gain. The core contributions stem from: (1) step-by-step CoT reasoning that mitigates compositional deficiencies, (2) the journalist persona that guides structured expression, and (3) structural matching between the TX format and reference answers.
Why LoRA Fine-Tuning Performs Poorly: According to the multiplicative joint scaling law, larger models require more fine-tuning data. EXAONE (7.8B) is approximately 63 times larger than KoBART (124M), meaning it requires 63 times more data to achieve equivalent fine-tuning benefits, which the current dataset (7,170 training samples) is far from sufficient to provide.
Logic Behind ICL Outperforming Fine-Tuning: Tabular-TX essentially encodes domain knowledge (how to analyze table data types and how to structure outputs) into prompts, leveraging the zero/few-shot capabilities of LLMs to bypass the fine-tuning bottleneck caused by insufficient data.
Model Generalizability: Tabular-TX achieves significant improvements across two different Korean LLMs, demonstrating that the method is model-agnostic.

Highlights & Insights¶

"Begin with the End in Mind" Design Philosophy: First defining the ideal output structure (the TX format), then backward-inducing the required reasoning steps to generate that structure. This top-down methodologies design is of immense reference value in prompt engineering.
Cross-disciplinary Fusion of Journalism and NLP: Introducing reporting standards (conciseness, objectivity, inverted pyramid structure) of news stories into LLM prompt design is an ingenious cross-domain adaptation.
ICL as a Low-resource Alternative: In actual scenarios where annotated data and computational resources are constrained, a well-crafted ICL pipeline can be far more effective than straightforward fine-tuning. This offers important insights for practical deployment.
Engineering Value of Merged Cell Handling: The seemingly simple preprocessing step (replicating merged cells across all covered positions) actually resolves a critical barrier to LLMs' understanding of tabular hierarchical relationships.

Limitations & Future Work¶

Evaluation Limited to Two Korean Models: Both EXAONE and Llama-3-Korean-Bllossom are 8B-scale Korean models. The performance on larger models (70B+) and English/Chinese models has not yet been validated.
Restricted to Korean Administrative Tables: Whether the TX structure is applicable to other languages (such as Chinese and English) and tables from other domains (such as medical and technology) requires further investigation.
Rigidity of the Predefined Structure: The current TX structure is fixed (Theme + Explanation), which might not be optimal for certain table types. Ideally, it should adaptively adjust based on table characteristics.
Lack of Fine-grained Ablation: Contributions of individual components such as CoT, journalist persona, and the TX structure were not ablated separately, making it difficult to determine which component is the most critical.
Directions for Improvement:
- Introducing more flexible adaptive sentence structure generation.
- Validating generalizability on multilingual and multi-domain tables.
- Adding quantitative ablation experiments.
- Comparing with recent baselines like Chain-of-Table and TableLlama on the same dataset.

Chain-of-Table (Wang et al., 2024): Simplifies reasoning by reordering, extracting, and filtering table data. It performs excellently in structural processing and mathematical reasoning but struggles to integrate metadata and background knowledge.
TableLlama (Zhang et al., 2024b): A general table model fine-tuned on 14 datasets, yet it is computationally expensive.
FeTaQA (Nan et al., 2022): A key reference benchmark for table explanation; the dataset used in Tabular-TX is similar but focuses on the Korean administrative domain.
Insight: Combining the tabular manipulation capabilities of Chain-of-Table with the structured output format of Tabular-TX could yield a more robust table summarization system.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of the TX output structure and the journalist persona is a novel prompt engineering approach.
Experimental Thoroughness: ⭐⭐⭐ The core comparison is sufficient, but it lacks detailed ablation studies and cross-lingual validation.
Writing Quality: ⭐⭐⭐⭐ The method is clearly described with abundant examples, but the paper is relatively short.
Value: ⭐⭐⭐⭐ High practical value for low-resource table summarization scenarios, with promising potential for generalizing the TX structure.