ACL2026 NLP Understanding AI paper notes paper summaries LLM Question Answering Reasoning Information Extraction Sentiment Analysis Few-/Zero-Shot Learning

📖 NLP Understanding¶

💬 ACL2026 · 34 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (2) · 🧪 ICML2026 (2) · 🤖 AAAI2026 (1) · 🧠 NeurIPS2025 (3) · 📹 ICCV2025 (1) · 🧪 ICML2025 (1)

🔥 Top topics: LLM ×6 · Question Answering ×5 · Reasoning ×4 · Information Extraction ×4 · Sentiment Analysis ×2

A Computational Method for Measuring "Open Codes" in Qualitative Analysis: This paper proposes a theory-based computational method to systematically evaluate human and AI performance in inductive qualitative coding through an LLM-enhanced code merging algorithm and four ground-truth-free metrics (Coverage, Overlap, Novelty, and Divergence).
Accurate and Efficient Statistical Testing for Word Semantic Breadth: This paper identifies that directly comparing the semantic breadth of two words using permutation tests in contextual embedding space severely inflates Type-I errors due to differences in mean directions. It proposes using Householder reflections to align mean directions before permutation, reducing Type-I errors by 32.5%, and provides a GPU batch implementation achieving a 23x speedup.
AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models: This paper proposes AdapTime, which abstracts "temporal reasoning" into three reusable atomic actions: reformulate, rewrite, and review. Guided by an LLM Planner, the system adaptively decides which steps to execute and in what order based on the question and context. Without external tools, manual rules, or fine-tuning, it significantly improves LLM performance on temporal QA, pushing TimeQA-Easy to 85.4 EM on DeepSeek-V3.
Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations: The LiTEx reasoning taxonomy is extended from "explanation variation under label agreement" to "label disagreement" scenarios. It is found that annotators may have different labels but similar reasoning, and the consistency of reasoning categories reflects the semantic similarity of explanations better than label consistency.
ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering: ASTRA adaptively reconstructs complex tables into semantic trees and employs a dual-mode reasoning approach consisting of text tree navigation and symbolic code execution. It achieves accuracies of 91.6%, 81.9%, and 90.1% on AIT-QA, SSTQA, and HiTab, respectively, outperforming strong LLMs and existing table structuralization methods.
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering: This paper leverages Rhetorical Structure Theory (RST) to parse the discourse organization of long documents, constructing a sentence-level hierarchical tree with intermediate nodes enhanced by LLM summarization. By performing structure-aware multi-granularity retrieval on this tree, the proposed method consistently outperforms fixed-size chunking and RAPTOR-style semantic clustering across four benchmarks: QASPER, QuALITY, NarrativeQA, and MultiFieldQA-zh.
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation: BoundRL reframes structured text segmentation as a boundary generation task—generating only the starting tokens for each segment rather than the full text. This reduces output tokens by 90% and eliminates hallucination risks. Combined with a dual-objective reward function and a selective perturbation strategy for RLVR training, it enables a 1.7B small model to outperform the few-shot performance of Claude-4 Sonnet.
Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?: This paper constructs the ReCo reading comprehension cognitive complexity dataset and systematically evaluates whether 8 LLMs can automatically determine the required evidence scope and transformation levels for items. Results indicate that strong models approach but remain significantly lower than experts, particularly in identifying complete evidence sets and fine-grained word-order transformations.
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding: Ours proposes an automated method to augment existing commonsense knowledge bases with negation, constructing a negation commonsense corpus of over 2 million triplets (\(\neg \text{Atomic}\) and \(\neg \text{Anion}\)), and demonstrates that pretraining on this corpus enhances the negation understanding capabilities of LLMs.
Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs: This paper introduces IASC (Interactive Agentic System for ConLangs), a modular constructed language generation system. By requiring LLMs to execute morphosyntactic transformations based on linguistic specifications, the study probes their metalinguistic knowledge. Findings reveal that LLMs handle common linguistic typological patterns significantly better than rare ones, and performance varies drastically across different models.
DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis: The authors constructed DimABSA, the first multilingual (6 languages) and multi-domain (4 domains) dimensional aspect-based sentiment analysis dataset (76,958 aspect instances / 42,590 sentences). It replaces traditional "positive/negative/neutral" tri-classification with continuous valence–arousal scores, designs three new subtasks and a unified metric cF1, and evaluates 6 open/closed-source LLMs systematically.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition: DiZiNER achieves zero-shot SOTA on 14 out of 18 NER benchmarks by simulating the "pilot annotation" process from human labeling. It utilizes multiple heterogeneous LLMs as annotators and a supervisor LLM to analyze inter-model disagreements and iteratively refine task instructions, resulting in an average improvement of +8.0 F1 and outperforming its supervisor (GPT-5 mini).
EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain: The authors identify two major pain points in "scientific abstract" EE scenarios that are absent in legacy datasets: high information density (5.54 events + 12.82 arguments per 100 tokens) and complex event structures (overlapping/discontinuous/reverse-order nuggets + sub-events). Consequently, they (a) annotated the SciEvents dataset with 2,508 documents and 24,381 events, and (b) proposed EXCEEDS—an end-to-end framework that reformulates EE as "multi-label relation classification on an \(l \times l\) word-word grid." By utilizing three types of edges (HTL/THL/EAL) to unify the modeling of triggers, arguments, and sub-events, EXCEEDS outperforms 9 SOTA baselines in both main metrics and complex scenarios.
Exploring Concreteness Through a Figurative Lens: The authors decompose the internal representation of "concreteness" across four LLMs (Llama-3.1-8B / Qwen3-8B / Gemma2-9B / GPT-OSS-20B) using prompt-based probing, DiffMean, and SVD. They find that early layers already distinguish between literal (high concrete) and figurative (low concrete) noun usage. Mid-to-late layers compress concreteness information into a single one-dimensional direction. This axis achieves zero-shot figurative text classification performance nearly on par with supervised 4096-dimensional classifiers and can be directly added to hidden states to perform controllable "literal ↔ figurative" rewrites during generation.
Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?: The paper proposes a method where LLMs generate natural language "commonsense axioms" to bridge premises and hypotheses. A "factuality judge" filters unreliable axioms, and high-quality ones are injected back into the NLI prompt. Consequently, Llama-3.1-70B and gpt-oss-120b achieve accuracy gains of 1.99-6.88% on SNLI/ANLI and significantly mitigate the "Neutral" safety bias.
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction: The HCRE model is proposed to transform cross-document relation extraction from direct classification over large relation sets into layer-by-layer hierarchical classification by constructing a hierarchical relation tree. A predict-then-verify inference strategy is designed to mitigate inter-layer error propagation, significantly outperforming SLM and LLM baselines on the CodRED dataset.
It's High Time: A Survey of Temporal Question Answering: This paper provides a comprehensive survey of Temporal Question Answering (TQA), proposing a unified analytical framework based on three dimensions: corpus temporality, question temporality, and model temporal capability. It systematically reviews the evolution of TQA methods from rule-based pipelines to the Transformer/LLM era, organizes benchmark datasets and evaluation strategies, and identifies future challenges.
Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation: KARITA decomposes "temporal drift" into three complementary signals: uncertainty, feature distance, and ontological term rarity. For each target sample hit by these signals, it backtracks and retrieves semantically similar source samples with ground truth. It then employs LLM + domain ontologies (MeSH / EuroVoc / CSO) to generate synonym rewrites for data augmentation. This approach migrates the source model to future periods in a purely data-driven manner, consistently outperforming strong baselines on long-span multi-label classification data across clinical, legal, and scientific domains.
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases: Constructed the first structured classification system for Chinese civil legal relations (9 domains, 265 relation types) and proposed the LexRel benchmark (1,140 expert-annotated samples). Evaluated the capabilities of mainstream LLMs in legal relation extraction, identifying significant limitations in current models while demonstrating the performance gains legal relation information provides to downstream legal AI tasks.
LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines: This paper proposes an LLM-guided semantic bootstrapping framework. By utilizing LLMs to generate sub-intents and three-stage curriculum synthetic data, the authors train a Non-Negated Tsetlin Machine (NTM) to extract high-confidence symbolic features. These features are injected into real data, allowing a standard TM to approach BERT-level classification performance while maintaining full interpretability.
Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models: This paper investigates the sensitivity of Large Language Models (LLMs) to the order of prompt components in multiple-choice questions (MCQA). Through systematic experiments, it excludes training bias and memory decay hypotheses, revealing that the causal attention mask is the fundamental mechanism leading to significant performance degradation in the QOC (Question-Options-Context) order.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification: This paper proposes MADE—a "living" multi-label text classification benchmark based on FDA medical device adverse event reports, containing 1,154 hierarchical labels and strict temporal splitting. It systematically evaluates the predictive performance and uncertainty quantification (UQ) capabilities of 20+ encoder/decoder models under discriminative fine-tuning, generative fine-tuning, and few-shot prompting, revealing a critical trade-off: small discriminatively fine-tuned decoders are optimal for head-to-tail accuracy, generative fine-tuning provides the most reliable UQ, while large reasoning models improve rare labels but show unexpectedly weak UQ.
MetFuse: Figurative Fusion between Metonymy and Metaphor: The authors propose a three-stage pipeline (candidate generation → MLM scoring/selection → LLM refinement) to rewrite literal sentences into three figurative variants: metonymic, metaphoric, and hybrid. They construct the first MetFuse dataset (1,000 quadruplets, 4,000 sentences) and empirically discover that "the presence of metaphorical verbs makes metonymic nouns in the same sentence more explicit," yielding consistent improvements when used for data augmentation across 8 metonymy/metaphor classification benchmarks.
MSMO-ABSA: Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis: The MSMO framework is proposed for cross-lingual aspect-based sentiment analysis. It utilizes sentence-level Wasserstein adversarial training with code-switched data for language discriminator alignment and aspect-level bidirectional KL consistency training to align prediction distributions of aspects with the same sentiment. Complemented by multi-teacher knowledge distillation, it achieves new SOTA results across four target languages in SemEval-2016 using mBERT/XLM-R, significantly outperforming LLM solutions such as GPT-4o and Qwen2.5-7B-LoRA.
MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training: MTSQL-R1 transforms multi-turn Text-to-SQL from "one-shot translation" into a long-horizon agent training problem that interacts with databases and dialogue memory. Through self-teaching warm-start SFT and multi-level GRPO rewards, small-scale Qwen3 models outperform strong closed-source prompting baselines and short-horizon SFT/RL baselines on CoSQL and SParC.
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs: A reasoning-based cluster refinement framework is proposed, positioning LLMs as semantic judges (rather than embedding generators) to verify and restructure unsupervised clustering outputs. Through three reasoning stages—consistency verification, redundancy adjudication, and label grounding—this framework significantly improves cluster coherence and human-aligned labeling quality on social media corpora.
Refining and Reusing Annotation Guidelines for LLM Annotation: This paper transfers the guideline reuse and moderation processes from traditional manual annotation projects to LLM annotation. It demonstrates that explicit annotation guidelines, reasoning-based models, and iterative guideline refinement driven by a small amount of gold discrepancy can significantly improve strict span+type F1 in biomedical NER.
SAM-NER: Semantic Archetype Mediation for Zero-Shot Named Entity Recognition: SAM-NER utilizes a three-stage mediation framework consisting of "Entity Discovery → 14 Universal Semantic Archetypes → Target Type Definition Calibration" to alleviate schema drift in zero-shot NER, achieving an average micro-F1 of 66.3 on CrossNER, surpassing several strong baselines.
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling: This paper proposes RiSE, an inference-time semantic reranking framework that automatically identifies low-confidence hard examples and utilizes label semantic representations obtained from contrastive learning to rerank model outputs. It achieves an average gain of +9.15 macro-F1 on hard examples across eight Rhetorical Role Labeling datasets.
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey: This paper provides a comprehensive survey of Table Question Answering (TQA) research in the LLM era. It systematically categorizes task settings across five dimensions (table format, question complexity, answer format, modality, and domain) and organizes modeling approaches based on core challenges (table understanding, complex queries, large inputs, data heterogeneity, and knowledge integration). Covering 277 papers, it also provides forward-looking discussions on emerging directions such as reinforcement learning and interpretability.
Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers: This paper systematically compares 12 reasoning LLMs on full multiple-choice questions (MCQs) versus choices-only MCQs. It finds that test-time reasoning (TTR) indeed allows models to perform above chance in choices-only scenarios. However, reasoning traces reveal that this is not entirely shallow cheating but includes "strategic test-taking" behaviors such as inferring missing questions, eliminating incorrect options, and invoking factual knowledge.
The Imperfective Paradox in Large Language Models: This paper evaluates whether LLMs understand that "doing something" does not necessarily imply "having finished something" using the newly constructed ImperfectiveNLI diagnostic set. It finds that open-source LLMs generally misjudge telic events as completed; prompt engineering merely oscillates between reducing completion hallucinations and preserving legitimate entailments, suggesting the core issue is the dominance of teleological priors during the reasoning phase.
TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning: TruthSplit is an interactive argument analysis system that formalizes the phenomenon where "the same argument leads to different conclusions under different worldviews" as conditional validity. It decomposes text into claims, premises, and assumptions, employs a three-layer NLI check for logic and intra-worldview consistency, and utilizes six structured worldview personas to conditionalize LLM reasoning. The system generates interpretations and visualizes sources of divergence for each stance—not by assigning "right/wrong" labels, but by revealing whether disagreements stem from value prioritizations or conceptual definitions.
Revealing Temporal Framing in News Text: This paper proposes the concept of "temporal framing" in news text. Drawing from social science theories, it establishes a taxonomy consisting of 8 categories of temporal frames, annotates a bilingual English-German news corpus, and trains models for temporal frame detection using both supervised and zero-shot approaches.