📖 NLP Understanding¶
💬 ACL2026 · 14 paper notes
📌 Same area in other venues: 🤖 AAAI2026 (2) · 🧠 NeurIPS2025 (2) · 📹 ICCV2025 (1)
🔥 Top topics: LLM ×3 · Information Extraction ×2 · Question Answering ×2
- A Computational Method for Measuring "Open Codes" in Qualitative Analysis
-
This paper proposes a theoretically grounded computational framework that employs an LLM-augmented code merging algorithm alongside four ground-truth-free metrics (Coverage, Overlap, Novelty, and Divergence) to systematically evaluate the performance of both human and AI coders in inductive qualitative coding.
- Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations
-
This paper extends the LiTEx reasoning taxonomy from "label-consistent, explanation-variant" settings to label-disagreement scenarios, finding that annotators may share similar reasoning strategies despite assigning different labels, and that reasoning category agreement better reflects the semantic similarity of explanations than label agreement alone.
- BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
-
BoundRL redefines structured text segmentation as a boundary generation task — generating only each segment's start tokens rather than the complete text, reducing output tokens by 90% and eliminating hallucination risk. Combined with a dual-objective reward function and selective perturbation strategy for RLVR training, a 1.7B model surpasses Claude-4 Sonnet's few-shot performance.
- Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
-
This paper proposes an automated method for augmenting existing commonsense knowledge bases with negation, constructing a large-scale negated commonsense corpus (¬Atomic and ¬Anion) containing over 2 million triples, and demonstrates that pretraining on this corpus improves LLMs' negation understanding capabilities.
- Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs
-
This paper introduces IASC (Interactive Agentic System for ConLangs), a modular constructed-language building system that probes LLMs' metalinguistic knowledge by requiring them to perform morphosyntactic transformations according to linguistic specifications. The findings reveal that LLMs handle common typological patterns far better than rare ones, and that capability gaps across different LLMs are substantial.
- DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot NER
-
DiZiNER simulates the human pilot annotation workflow: multiple heterogeneous LLMs independently annotate the same text, and inter-model disagreements are analyzed to iteratively refine task instructions. The method achieves zero-shot SOTA on 14 out of 18 NER benchmarks, with an average F1 gain of +8.0, surpassing its supervisor model GPT-5 mini.
- HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction
-
This paper proposes HCRE, a model that reformulates cross-document relation extraction from direct classification over a large relation set into layer-wise hierarchical classification guided by a constructed relation tree. A predict-then-verify inference strategy is designed to mitigate inter-layer error propagation. HCRE achieves substantial improvements over both SLM and LLM baselines on the CodRED benchmark.
- It's High Time: A Survey of Temporal Question Answering
-
This paper presents a comprehensive survey of Temporal Question Answering (TQA), proposing a unified analytical framework along three dimensions—corpus temporality, question temporality, and model temporal capability—and systematically reviewing the evolution of TQA methods, benchmark datasets, and evaluation strategies from rule-based pipelines to the Transformer/LLM era, while identifying key challenges for future research.
- LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases
-
This work introduces the first structured taxonomy of legal relations in Chinese civil law (9 domains, 265 relation types) and presents LexRel, a benchmark comprising 1,140 expert-annotated instances. The benchmark is used to evaluate leading LLMs on legal relation extraction, revealing significant limitations of current models on this task, while also demonstrating that incorporating legal relation information yields consistent gains on downstream legal AI tasks.
- LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines
-
This paper proposes an LLM-guided semantic bootstrapping framework that leverages LLMs to generate sub-intents and trains a Non-Negated Tsetlin Machine (NTM) via three-stage curriculum synthetic data generation. High-confidence symbolic features extracted by the NTM are injected into real data representations, enabling a standard TM to approach BERT-level classification performance while maintaining full interpretability.
- Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models
-
This paper systematically investigates the sensitivity of large language models to the ordering of prompt components in multiple-choice question answering (MCQA). Through controlled experiments, the authors rule out training bias and memory decay hypotheses, identifying the causal attention mask as the fundamental mechanism responsible for the substantial performance degradation observed under the QOC (Question–Options–Context) ordering.
- MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification
-
This paper introduces MADE—a "living" multi-label text classification benchmark built on FDA medical device adverse event reports, featuring 1,154 hierarchical labels and strict temporal splits. It systematically evaluates 20+ encoder/decoder models across discriminative fine-tuning, generative fine-tuning, and few-shot prompting paradigms, assessing both predictive performance and uncertainty quantification (UQ) capabilities. Key findings reveal critical trade-offs: small discriminatively fine-tuned decoders achieve the best head-to-tail accuracy; generative fine-tuning yields the most reliable UQ; and large reasoning models improve rare-label performance but exhibit surprisingly weak UQ.
- Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
-
A reasoning-based cluster refinement framework that uses LLMs as semantic judges (rather than embedding generators) to verify and restructure unsupervised clustering outputs through coherence verification, redundancy adjudication, and label grounding, significantly improving cluster consistency and human-aligned annotation quality on social media corpora.
- Table Question Answering in the Era of Large Language Models: A Comprehensive Survey
-
This paper presents a comprehensive survey of Table Question Answering (TQA) research in the era of large language models. It systematically categorizes task settings along five dimensions (table format, question complexity, answer format, modality, and domain), organizes modeling approaches around five core challenges (table understanding, complex queries, large input handling, data heterogeneity, and knowledge integration), covers 277 papers, and provides forward-looking discussions on emerging directions such as reinforcement learning and interpretability.