✍️ Text Generation¶

💬 ACL2025 · 27 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (12) · 💬 ACL2026 (17) · 🧪 ICML2026 (2) · 🤖 AAAI2026 (3) · 📹 ICCV2025 (1) · 🧪 ICML2025 (1)

🔥 Top topics: Summarization ×8 · LLM ×4 · Personalized Generation ×2

A Representation Level Analysis of NMT Model Robustness to Grammatical Errors: A systematic representation-level analysis of how NMT encoders process grammatical errors reveals that encoders first "detect" errors in shallow layers (indicated by rising GED probing \(F1\)), and then "correct" them in deep layers (indicated by falling CKA distance). It proposes the concept of "Robustness Heads" to identify the specific attention heads involved in error correction, validating this two-stage "detection-then-correction" mechanism across 4 models \(\times\) 5 language directions.
Abstractive Snippet Generation: This paper proposes an abstractive snippet generation method for search engines. By utilizing query-aware summarization generation techniques, it generates more concise and informative text snippets for search result pages compared to traditional extractive snippets, significantly improving the user search experience.
An Empirical Study of Many-to-Many Summarization with Large Language Models: This work presents the first systematic study of Large Language Model (LLM) performance on the Many-to-Many Summarization (M2MS) task. By integrating 8 datasets, the authors construct a benchmark containing 47.8K samples across 5 domains and 6 languages. Evaluating 18 LLMs reveals that zero-shot LLMs perform comparably to fine-tuned traditional models, and significantly outperform them after instruction tuning. However, factual consistency remains a critical bottleneck.
ATGen: A Framework for Active Text Generation: The authors propose ATGen, the first systematic active learning (AL) framework for NLG. It integrates state-of-the-art (SOTA) AL strategies, human/LLM annotation interfaces, parameter-efficient fine-tuning (PEFT), and vLLM inference optimization. Evaluation on four NLG tasks (including TriviaQA and GSM8K) demonstrates that active learning can reduce annotation costs by 2 to 4 times.
Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation: This paper proposes a systematic evaluation framework based on Context-Preserving Prefix Trees (CP-Trie) to evaluate the intrinsic adaptability of truncation sampling methods between diversity and risk using probability-free and tuning-free metrics, providing practical guidance for parameter selection in real-world applications.
CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation: Proposes CoCoLex, a training-free decoding strategy that constructs a copy distribution using the Euclidean distance between decoding hidden states and context token hidden states. By using a prediction entropy-based confidence score to dynamically balance the ratio of "copying from context" and "free generation", it consistently improves faithfulness and correctness across five legal benchmarks, showcasing particularly outstanding performance in long-text generation tasks.
Context-Aware Hierarchical Merging for Long Document Summarization: This work proposes Context-Aware Hierarchical Merging (CAHM), which effectively mitigates LLM hallucinations during ultra-long document (>100K tokens) summarization by incorporating relevant source document context (via extractive, retrieval, or citation methods) into the hierarchical merging process.
Decomposed Opinion Summarization with Verified Aspect-Aware Modules: This study decomposes the opinion summarization task into three progressively verifiable modules—Aspect Identification, Opinion Consolidation, and Meta-Review Synthesis. By using zero-shot prompting on LLMs, a domain-independent modular processing pipeline is achieved, generating more traceable and comprehensive summaries across three domains: peer reviews, business reviews, and product reviews.
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems: Through a literature review and crowdsourcing study, this work systematically compiles 21 categories of interventions to mitigate anthropomorphism in text generation system outputs. It proposes a four-dimensional conceptual framework encompassing intervention type, target behavior, operationalization, and negative impact, providing the most comprehensive infrastructure for deanthropomorphization research.
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport: Proposes MBR-OT, which introduces Optimal Transport (Wasserstein distance) into Minimum Bayes Risk (MBR) decoding to evaluate document-level output quality using sentence-level utility functions. It significantly outperforms standard MBR decoding on document-level machine translation, text simplification, and dense image captioning tasks.
Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries: Proposes decoupling multi-document summarization into a three-step pipeline: key point extraction \(\to\) DPP diversity selection \(\to\) rewriting. By using Determinantal Point Processes (DPP) for principled content selection, the method significantly improves the source document coverage of LLMs in multi-document summarization.
DTCRS: Dynamic Tree Construction for Recursive Summarization: DTCRS is proposed to dynamically construct summary trees based on document structure and query semantics. By using question decomposition and sub-question guided clustering, it reduces redundant summary nodes and significantly outperforms the static summary tree method RAPTOR on three QA datasets.
Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study: This paper proposes a language-neutral text editing approach (SWEET) that does not rely on language-specific edit sets. By introducing data-driven automated extraction and compression strategies for edit tags, this work successfully applies the text editing paradigm to Arabic Grammatical Error Correction (GEC) for the first time, achieving state-of-the-art performance across multiple benchmarks while increasing inference speed by over 6x.
Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: Constructs a multi-document event relation graph (containing four types of intra-document event relations, cross-document event coreferences, and event-level moral foundations) and injects bias information into LLMs via two strategies: graph serialization and graph prompt tuning, generating unbiased neutralized summaries that outperform baselines in both content preservation and bias mitigation.
gec-metrics: A Unified Library for Grammatical Error Correction Evaluation: This paper proposes gec-metrics, a unified library that integrates 10 grammatical error correction (GEC) evaluation metrics into a single interface. It also provides meta-evaluation functionalities, addressing the issues of fragmentation, non-reproducibility, and limited extensibility in existing GEC evaluation implementations.
IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator: By introducing a Grammatical Error Detection (GED) pre-training step before constructing IMPARA's quality estimator, and removing the ineffective similarity estimator, reference-free GEC evaluation achieves the highest sentence-level correlation on SEEDA.
Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation: Proposed Dynamic Focus Decoding (DFD), which identifies knowledge-intensive decoding steps by tracking inter-layer distribution discrepancies (KL divergence) in LLMs and adaptively adjusts temperature—lowering temperature on knowledge-intensive steps to preserve factuality, and raising temperature on non-knowledge-intensive steps to promote diversity—simultaneously improving factuality and diversity across seven datasets.
Personalized Text Generation with Contrastive Activation Steering: StyleVector is proposed as a training-free framework for personalized text generation. It extracts a "style vector" by contrasting the hidden layer activation differences between real user responses and style-free model generations. During inference, a simple linear activation intervention steers the LLM to generate text conforming to the user's writing style. It achieves an 8% relative improvement on the LaMP and LongLaMP benchmarks while reducing storage requirements to 1/1700 of PEFT methods.
PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization: This paper proposes the PerSphere benchmark dataset and the MURS (Multi-faceted perspective retrieval and summarization) task, which aims to retrieve and comprehensively summarize multi-faceted perspectives on controversial issues from a document collection. It also proposes HierSphere, a hierarchical multi-agent summarization system, to alleviate challenges related to long contexts and perspective extraction.
Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?: This paper points out a fundamental discrepancy between current automatic evaluation and human evaluation in GEC regarding the aggregation pipeline "from sentence-level scores to system rankings." Specifically, human evaluation relies on sentence-level pairwise comparisons combined with the TrueSkill ranking algorithm, whereas automatic evaluation typically uses average absolute scores followed by sorting. By adopting TrueSkill aggregation for automatic evaluation to bridge this gap, this study substantially improves the correlation of most metrics with human evaluation on the SEEDA benchmark, even allowing BERT-level metrics to outperform GPT-4.
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks: This paper proposes TagRouter, which uses a small tag generator to compress open-domain text generation requests into a set of semantic tags, and then routes requests by analyzing the relative advantages of each candidate LLM based on these tags. This achieves a higher system acceptance rate than any single large model without retraining the router, while significantly reducing inference costs.
Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes: The authors propose the Retell method: leveraging small LMs to generate abstractive retellings of literary passages, converting "showing" sensory details in narratives into "telling" high-level concepts, and subsequently running LDA topic modeling on the retold texts. Under resource-constrained conditions, this approach significantly outperforms baselines using direct LDA and directly querying LMs for topic labels.
Theme-Explanation Structure for Table Summarization Using Large Language Models: The authors propose the Tabular-TX pipeline, which achieves deep table understanding via multi-step CoT reasoning, generates clear sentences using a journalist persona prompt, and structures the output into a Theme (adverbial theme) + Explanation (predicative explanation) format. On a Korean administrative table summarization benchmark, it achieves the best performance with a ROUGE-1 score of 0.51 without relying on fine-tuning, significantly outperforming fine-tuning and pure ICL methods.
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework: To address the trade-off among multiple metrics (coherence/diversity/perplexity) in open-ended text generation, this paper proposes three complementary multi-criteria evaluation methods: the Extended Bradley-Terry model (ordinal ranking), Union-Free Generic Depth (partial ordering allowing incomparability), and Q*Text (cardinal comprehensive evaluation metric). Validated on over 1.8 million generated texts across 6 LLMs, 59 decoding strategies, and 3 datasets, the results show that moderate hyperparameter configurations generally outperform extreme ones, and smaller models with appropriate decoding strategies can match the performance of larger models.
Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing: Starting from dynamical systems theory, this paper reveals that during successive paraphrasing, the outputs of LLMs converge to stable 2-period attractor cycles instead of exploring a broad paraphrase space, uncovering inherent limitations in the generation capabilities of LLMs.
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations: This paper proposes the VISTA dataset consisting of 18,599 AI conference presentation videos paired with paper abstracts, and introduces a plan-based summarization framework that guides structured summary generation for scientific videos by generating intermediate question sequences, significantly improving factual consistency.
Writing Like the Best: Exemplar-Based Expository Text Generation: Defines a new task, "Exemplar-Based Expository Text Generation"—generating an expository text about a target topic given an exemplar text about a source topic. It proposes the Recurrent Plan-then-Adapt (RePA) framework, which recurrently processes paragraph-level imitation planning, retrieval-augmented adaptive generation, and a dual-memory mechanism. RePA significantly outperforms GPT-4 and o1 baselines across three datasets: Wikipedia, RoleEE, and USNews.