ACL2025 NLP Understanding AI paper notes paper summaries Question Answering Reasoning LLM Information Extraction Sentiment Analysis

📖 NLP Understanding¶

💬 ACL2025 · 30 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (2) · 💬 ACL2026 (34) · 🧪 ICML2026 (2) · 🤖 AAAI2026 (1) · 🧠 NeurIPS2025 (3) · 📹 ICCV2025 (1)

🔥 Top topics: Question Answering ×16 · Reasoning ×5 · LLM ×5 · Information Extraction ×4 · Sentiment Analysis ×4

A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment: This paper proposes the GraphMPA framework, which achieves global document understanding by constructing a hierarchical document graph based on general similarity metrics, and introduces mode-seeking preference optimization to replace traditional DPO for more precise human preference alignment, comprehensively outperforming existing RAG methods across six QA datasets.
A Variational Approach for Mitigating Entity Bias in Relation Extraction: Proposes an entity debiasing method based on Variational Information Bottleneck (VIB) that maps entity tokens to Gaussian distributions to selectively compress entity-specific information while preserving contextual semantics. This achieves SOTA performance across relation extraction datasets in generic, financial, and biomedical domains, particularly showing a notable improvement of 5.3 F1 points on BioRED in OOD scenarios.
Active LLMs for Multi-hop Question Answering: This paper proposes an active large language model framework that enables the LLM to actively decide when external information retrieval is required and when direct reasoning can be performed, thereby achieving a more efficient and accurate reasoning process in multi-hop question answering tasks.
Adapting Psycholinguistic Research for LLMs: Gender-Inclusive Language in a Coreference Context: By adapting the psycholinguistic experiment of Tibblin et al. (2023) from French to English and German LLMs, this work measures coreferent word probabilities and analyzes generated content. The findings show that: English LLMs generally maintain antecedent-coreference gender consistency, but singular they is rarely used and an underlying masculine bias persists. The German Leo Mistral 7B model exhibits a stronger masculine bias that dominates all 8 gender-inclusive strategies; nevertheless, these inclusive strategies still increase the probability of feminine/neutral gender occurrences, aligning with the results of human psycholinguistic experiments.
Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification: Proposes a political bias analysis framework for LLMs based on Target-Oriented Sentiment Classification (TSC). By substituting the names of 1,319 politicians into 450 political sentences and predicting sentiments using 7 models across 6 languages, this study defines an entropy-based inconsistency metric to quantify bias. The findings reveal that LLMs exhibit a positive bias toward left-wing and centrist politicians and a negative bias toward the far-right, with larger models demonstrating stronger and more consistent biases.
Automatic Generation of Inference Making Questions for Reading Comprehension Assessments: A reading comprehension inference question taxonomy (pronominal bridging / text-connecting / gap-filling) is developed to automatically generate multiple-choice questions for specific inference types using GPT-4o few-shot prompting; while 93.8% of the questions are of acceptable quality, only 42.6% accurately match the target inference type, indicating LLMs still lack precise control over their reasoning abilities.
BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering: This paper proposes BELLE, a bi-level multi-agent debate framework. It first classifies multi-hop questions into four types, and then dynamically plans the optimal combination scheme of operators (such as CoT, single-step retrieval, and iterative retrieval) through a bi-level debate mechanism (a first-level affirmative-negative debate + a second-level fast/slow debater supervision), realizing adaptive multi-hop reasoning tailored to different question types.
BookCoref: Coreference Resolution at Book Scale: This work proposes BookCoref, the first book-scale coreference resolution benchmark. By employing an automatic annotation pipeline integrating character linking, LLM filtering, and window expansion, it generates high-quality silver annotation data across 50 full novels, with an average document length exceeding 200k tokens.
BQA: Body Language Question Answering Dataset for Video Large Language Models: Based on the BoLD dataset, BQA is constructed via a four-step semi-automatic pipeline. BQA is a body language emotion recognition multiple-choice QA benchmark containing 7,632 short videos. Evaluation reveals that the strongest VideoLLMs (GPT-4o/Gemini) achieve an accuracy of only about 60%, which is far below human performance (85%). Furthermore, it exposes the models' over-reliance on facial expressions and significant biases towards specific racial groups.
CaLMQA: Exploring Culturally Specific Long-Form Question Answering across 23 Languages: The first multilingual long-form question answering dataset, CaLMQA (51.7K questions, 23 languages), is constructed. Culturally specific questions are collected using a translation-free approach. The study reveals that the factuality of large language models (LLMs) on culturally specific questions (45-52%) is significantly lower than on culturally neutral questions (64-71%), with low-resource languages showing particularly poor performance.
Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?: Using Item Response Theory (IRT) to evaluate 11 LLMs on the same capability scale as real students, this work finds that strong models without styling far outperform average students. While persona prompting to "act as a student of a certain grade" can alter performance, no single model-prompt combination can reliably simulate an average student across all subjects and grades.
Disambiguate First, Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing: A modular approach of "disambiguate first, parse later" is proposed, which leverages LLMs to generate default interpretations and trains a specialized infilling model to complete missing ones, thereby transforming ambiguous natural language queries into multiple explicit interpretations before parsing them into SQL individually.
Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis: This paper proposes the Dynamic Order Template (DOT) method, which decomposes ABSA sentiment quadruple generation into two stages: first predicting the template size (number of quadruples) and generating the initial template, and then generating specific sentiment quadruples based on the dynamic templates. This approach achieves SOTA performance across 9 ABSA datasets while reducing inference time by 7 times compared to MvP.
Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering: EmbQA proposes an embedding-level ODQA framework. It optimizes query representations using lightweight linear layers and unsupervised contrastive learning to achieve passage reranking. Furthermore, it introduces exploratory embeddings based on order statistics to expand candidate answer diversity, coupled with an entropy-based selection mechanism for automatic answer selection. EmbQA outperforms prompt-level methods like SuRe with significantly lower computational cost across four ODQA datasets.
End-to-End Dialog Neural Coreference Resolution: Balancing Efficiency and Accuracy in Large-Scale Systems: An end-to-end neural coreference resolution system is proposed. By combining contextual embeddings, a hierarchical attention mechanism, and optimization strategies (pruning/quantization), it achieves a balance between efficiency and accuracy, with SpanBERT reaching 87.3 F1 on benchmark datasets such as OntoNotes.
Towards a More Generalized Approach in Open Relation Extraction: This paper proposes the MixORE framework, which operates under a highly generalized Open Relation Extraction setting (where unlabeled data simultaneously contains both known and novel relations, without making any long-tail or pre-segmentation assumptions). By utilizing a Semantic Autoencoder to detect novel relations, combined with open-world semi-supervised joint learning, MixORE comprehensively outperforms state-of-the-art (SOTA) methods on FewRel, TACRED, and Re-TACRED.
Generating Diverse Training Samples for Relation Extraction with Large Language Models: This paper investigates how to use LLMs to generate high-quality and diverse training samples for relation extraction (RE). It proposes an in-context learning (ICL) based one-by-one generation strategy and a DPO-based diversity fine-tuning method. The generated training data effectively improves the performance of few-shot RE models.
In the LLM Era, Word Sense Induction Remains Unsolved: This paper systematically evaluates the Word Sense Induction (WSI) task in the LLM era. On a more rigorously controlled SemCor-derived evaluation set, it is found that all unsupervised methods, including LLM-based approaches, fail to outperform the simple "one sense per word" baseline. Meanwhile, a semi-supervised method combining Wiktionary outperforms the previous SOTA by 3.3%, indicating that WSI remains far from being solved.
iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering: iQUEST proposes an iterative sub-question guided framework that dynamically generates answerable sub-questions at each reasoning step to maintain reasoning direction. Combined with GNNs to aggregate semantic information from two-hop neighbors for "look-ahead" entity exploration, it achieves SOTA or near-SOTA performance on four benchmarks (CWQ, WebQSP, WebQuestions, and GrailQA) without the need to fine-tune the LLM.
Multi-Hop Reasoning for Question Answering with Hyperbolic Representations: By simply inserting a single Poincaré hyperbolic layer into a T5 encoder-decoder model, this work maps Euclidean embeddings to hyperbolic space for multi-hop reasoning with minimal model modifications. Experiments across four datasets consistently outperform Euclidean counterparts, demonstrating the effectiveness of \(\delta\)-hyperbolicity-based curvature initialization and showing that hyperbolic space is more advantageous on datasets with stronger hierarchical structures.
On Synthesizing Data for Context Attribution in Question Answering: This paper proposes SynQA, a synthetic data strategy based on the "given context sentences \(\rightarrow\) generate QA pairs" paradigm, designed to train small models for context attribution tasks (i.e., identifying supporting evidence sentences for QA system answers). SynQA significantly outperforms zero-shot inference and LLM ensemble methods across multiple QA tasks and cross-domain scenarios.
QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering: This paper proposes the QQSUM task and the QQSUM-RAG framework. By leveraging KP-oriented retrieval and clustering alongside a Next-KP-Generation training strategy, it generates Key Point summaries containing diverse opinions and their quantified popularity from product reviews, addressing the limitation of traditional PQA systems that only output single-perspective answers.
Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data: This paper proposes the ReQAP method, which constructs executable operator trees via recursive question decomposition to achieve complex question answering over heterogeneous (structured + unstructured) personal data, supporting lightweight on-device deployment.
ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision: This paper proposes ReSCORE, which leverages the joint probability of document-query relevance and document-answer consistency generated by an LLM as pseudo-labels to train a dense retriever in an unsupervised manner within an iterative RAG framework, achieving SOTA performance on three multi-hop QA datasets.
Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints: In response to the counterintuitive phenomenon where "directly inputting semantic parsing results into LLMs actually degrades performance," this paper proposes SENSE—a zero-shot method that embeds semantic hints (rather than explicit parsing results) in the prompt, consistently improving LLM performance across GLUE understanding tasks and generation tasks such as machine translation, paraphrasing, and simplification.
RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering: Proposes RISE—a multi-hop QA framework combining RAG with self-iterative training. Through a self-exploration loop consisting of three actions—question decomposition, retrieve-and-read, and self-critique—it iteratively generates training data and multi-objectively optimizes the model, outperforming GPT-3.5 and all 8B-tier baselines on 2Wiki, HotpotQA, and MuSiQue.
Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering: The SiGIR framework is proposed to enable models with iterative question decomposition, retrieval, reasoning, and self-evaluation capabilities through end-to-end training. During inference, it utilizes self-critique feedback to guide iteration-level beam search for selecting optimal reasoning paths, outperforming the state-of-the-art (SOTA) by an average of 8.6% across three multi-hop QA datasets.
Sentiment Reasoning for Healthcare: This work introduces a new task termed "Sentiment Reasoning," which requires models to generate explanatory rationales while predicting sentiment labels for healthcare conversations. A multimodal sentiment analysis dataset comprising 30K samples across five languages is constructed. Rationale-augmented training improves classification accuracy and macro-F1 by approximately 2%.
SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Analysis: This paper proposes the SynGraph framework, which categorizes sparse users on a continuous-time dynamic graph into three classes (mid-tail, long-tail, and extreme), and leverages LLMs to synthesize augmented data tailored to different sparsity levels (combining local-global graph understanding, high-order relationships, and profile generation) to effectively alleviate data sparsity issues in streaming review sentiment analysis.
A Variational Approach for Mitigating Entity Bias in Relation Extraction: Proposes applying Variational Information Bottleneck (VIB) to entity debiasing in relation extraction. By mapping entities to a probability distribution \(\mathcal{N}(\mu, \sigma)\) to compress entity-specific information while retaining task-relevant features, the variance \(\sigma^2\) can quantify the model's level of reliance on entities vs. context. It achieves SOTA on both ID and OOD settings across three domains: TACRED, REFinD, and BioRED.