ACL2025 ACL2025 accepted papers ACL2025 paper list AI paper notes top conference papers LLM (Other) Multimodal VLM LLM Evaluation Multilingual & Translation Information Retrieval & RAG Alignment & RLHF Model Compression LLM Agent

💬 ACL2025 Accepted Papers¶

1853 ACL2025 paper notes covering LLM (Other) (442), Multimodal VLM (111), LLM Evaluation (89), Multilingual & Translation (86), Information Retrieval & RAG (84), Alignment & RLHF (82), Model Compression (78), LLM Agent (55) and other 47 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

💡 LLM Reasoning (54)¶

An Efficient and Precise Training Data Construction Framework for Process-Supervised Reward Model in Mathematical Reasoning: This paper proposes the EpicPRM framework, which quantifies the contribution of each reasoning step through perplexity-based Monte Carlo estimation and utilizes adaptive binary search to efficiently locate the first incorrect step. It constructs Epic50k, a high-quality process-supervised dataset (with only 50k annotated steps), which trains a PRM that performs comparably to or even outperforms models trained on PRM800k.
Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework: This paper proposes Aristotle, a logical reasoning framework that fully integrates symbolic expressions and logical rules into every stage of the Decompose-Search-Resolve process. Utilizing three core components—a logical decomposer, a search router, and a resolver—it achieves logic-complete reasoning, outperforming SOTA on several logical reasoning benchmarks with an average improvement of 4.5% on GPT-4 and 5.4% on GPT-4o.
Beyond the Answer: Advancing Multi-Hop QA with Fine-Grained Graph Reasoning and Evaluation: To address the issues of opaque reasoning processes and coarse evaluation granularity in multi-hop question answering (Multi-hop QA), this paper proposes a fine-grained graph reasoning framework. By constructing a reasoning graph to explicitly model evidence chains, and introducing fine-grained evaluation metrics, the framework measures the quality of the reasoning process rather than solely focusing on the correctness of the final answer.
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving: The BPP-Search algorithm is proposed, which integrates Beam Search, Process Reward Models (PRMs), and a Pairwise Preference mechanism into the Tree-of-Thought framework for automatic mathematical modeling in operations research, significantly outperforming CoT/SC/ToT baselines on datasets like StructuredOR with fewer reasoning steps.
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?: This paper introduces DeltaBench, the first benchmark dataset to systematically evaluate the quality of long CoT reasoning in o1-like models and the error detection capabilities of existing LLMs/PRMs. Through fine-grained human annotation of 1,236 samples, it reveals a sobering reality: o1-like models exhibit approximately 27% reasoning redundancy, 67.8% ineffective reflections, and even the strongest critic model, GPT-4-turbo-128k, achieves only an F1 score of 40.8%.
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective: Proposed the Chain-of-Reasoning (CoR) framework, which unifies three paradigms—Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)—into a single reasoning chain. Guided by a Progressive Paradigm Training (PPT) strategy, a 7B model (CoR-Math-7B) achieves a 41% accuracy improvement over GPT-4o on theorem proving under zero-shot settings, and outperforms reinforcement learning (RL) methods by 15% on the MATH benchmark.
ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations: ClozeMath proposes a fine-tuning strategy inspired by human cloze learning. By masking equations in mathematical solutions and training the model to predict them (a text-infilling objective) jointly with standard language modeling objectives, ClozeMath significantly outperforms the strong baseline Masked Thought on GSM8K and MATH. It also demonstrates superior generalization in test-time scaling and robustness evaluations.
Commonsense Abductive Reasoning using Knowledge from Multiple Sources: This paper proposes a commonsense abductive reasoning method that integrates multi-source knowledge (knowledge graphs, pre-trained language models, and rule bases). By jointly utilizing structured and unstructured knowledge to generate more accurate and explainable best explanations, the method achieves significant improvements on abductive reasoning benchmarks.
Complex Reasoning with Natural Language Contexts and Background Knowledge: This paper proposes a complex reasoning framework that integrates natural language contexts with structured background knowledge. By utilizing knowledge graph retrieval augmentation and context-aware reasoning chain generation, it significantly improves LLM performance on multi-step reasoning tasks that require external knowledge support.
CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis: This paper proposes CoT-based Synthesizer—a novel inference scaling strategy that leverages CoT reasoning to analyze complementary information from multiple candidate responses to synthesize a superior final answer. Even when all candidate answers are incorrect, it can still synthesize the correct answer, achieving an 11.8% Gain for Llama3-8B and a 10.3% Gain for GPT-4o on MATH500.

Browse all 54 LLM Reasoning papers →

🦾 LLM Agent (55)¶

Agentic Knowledgeable Self-Awareness: This paper proposes KnowSelf, a data-driven approach that labels special tokens on the agent's self-exploration trajectories to identify different thinking situations (fast thinking, slow thinking, knowledgeable thinking). Through a two-stage training process (SFT + RPO), the agent model learns to autonomously judge when to invoke external knowledge, achieving optimal planning performance with minimal knowledge consumption cost.
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools: Agentic Reasoning proposes a framework that integrates three agent tools—Web search, code execution, and knowledge-graph-based memory (Mind-Map)—into the LLM reasoning process. It improves the accuracy of DeepSeek-R1 on Humanity's Last Exam from 9.4% to 23.8% (+14.4%) and GPQA from 71.5% to 81.2%, approaching the performance level of OpenAI Deep Research.
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems: This paper proposes the Agentic Reward Modeling paradigm and its implementation, RewardAgent, which integrates traditional human preference-based reward models with verifiable correctness signals from factuality and instruction-following verification. It significantly enhances the reliability of reward models through a three-module architecture consisting of a Router, Verification Agents, and a Judger.
Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks: This paper is the first to systematically study adversarial attacks in realistic multi-agent LLM systems featuring bandwidth constraints, latency, and security mechanisms. It proposes attack methods based on Minimum-Cost Maximum-Flow (MCMF) topological optimization and Permutation-Invariant Evasion Loss (PIEL), achieving up to a 7-fold increase in success rate compared to traditional attacks across multiple LLM architectures.
An Empirical Study on LLM-based Agents for Automated Bug Fixing: This paper systematically analyzes the top six LLM-based bug-fixing systems on SWE-bench Verified, revealing the capabilities and future directions of current agent systems across three dimensions: overall fixing effectiveness, fault localization accuracy, and the utility of bug reproduction.
AndroidGen: Building an Android Language Agent under Data Scarcity: This paper proposes the AndroidGen framework, which enhances LLM capabilities for Android operations under conditions of high-quality training data scarcity using four modules: Experience Search (ExpSearch), Reflection Planning (ReflectPlan), Automatic Checking (AutoCheck), and Step-level Critic (StepCritic). It successfully trains open-source mobile agents without manual annotation by automatically generating trajectory data.
Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning: A fully automated thematic analysis (TA) pipeline based on multi-agent LLMs is proposed. Through division of labor among specialized roles and optional RLHF fine-tuning, the system achieves end-to-end theme extraction from clinical narratives, eliminating the need for manual coding and full-text review.
Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines: Proposes Bel Esprit, a multi-agent conversational framework. Through a four-step collaboration of Mentalist (requirement clarification) $\rightarrow$ Builder (pipeline construction) $\rightarrow$ Inspector (validation) $\rightarrow$ Matchmaker (model mapping), it automatically transforms vague natural language requirements from users into multi-model AI pipeline graphs, achieving 25.2% EM and 37.0 GED (with GPT-4o Builder) on 441 pipeline test cases.
Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents: This work systematically evaluates the zero-shot in-context decision-making capabilities of LLMs in Dueling Bandits (preference-feedback reinforcement learning). It reveals that GPT-4 Turbo excels in weak regret but displays a gap in sstrong regret. Consequently, the LEAD (LLM with Enhanced Algorithmic Dueling) framework is proposed, which achieves both theoretical guarantees and robustness by adaptively and fine-grainedly integrating classical DB algorithms with LLM agents.
BookWorld: From Novels to Interactive Agent Societies for Story Creation: BookWorld is the first multi-agent social simulation system based on novels. It constructs interactive virtual worlds by extracting character data and worldview specifications from source books, allowing novel characters to act and interact autonomously to generate creative stories, outperforming previous story generation methods in 75.36% of pairwise comparisons.

Browse all 55 LLM Agent papers →

👥 Multi-Agent (8)¶

Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems: This paper systematically decomposes multi-agent collaboration into four dimensions (governance mode, participation control, interaction pattern, context management). Through extensive experiments on two context-dependent tasks, it demonstrates that the combination of centralized governance + instructor-controlled participation + ordered interaction + instructor summarization is optimal, reducing token consumption by up to 93% while maintaining or even improving accuracy.
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games: This paper proposes the CoMet framework. By integrating a hypothesis-testing-based metaphor reasoner and a self-improving metaphor generator, CoMet enables LLM agents to utilize metaphors for covert communication and semantic evasion in multi-agent language games. It significantly enhances the strategic communication capabilities of agents in Undercover and Adversarial Taboo games (improving the win rate from 0.20 to 0.70).
CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate: This paper proposes CortexDebate, a multi-agent debate method inspired by the mechanism of the human cerebral cortex. By constructing a sparse dynamic debate graph and an evaluation module based on the McKinsey Trust Formula (MDM), it simultaneously addresses two core challenges of existing Multi-Agent Debate (MAD) methods: "excessively long input context" and "unequal debate caused by overconfidence."
DocAgent: A Multi-Agent System for Automated Code Documentation Generation: Proposes DocAgent, an automated code documentation generation system based on topological dependency sorting. Through a collaborative Reader-Searcher-Writer-Verifier workflow, it incrementally constructs context, significantly outperforming FIM and Chat baselines across completeness, helpfulness, and truthfulness.
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning: GETReason is proposed as a hierarchical multi-agent framework that decomposes the context extraction of public event images into three sub-tasks: geospatial, temporal, and event. These tasks are collaboratively completed by specialized agents, achieving more accurate image context reasoning than existing methods.
Multi-Agent Collaboration via Cross-Team Orchestration: This paper proposes Cross-Team Orchestration (Croto), a scalable multi-team collaboration framework that organizes multiple independent agent teams for cross-team interaction, utilizing Hierarchy Partitioning and Greedy Aggregation mechanisms to fuse diverse solutions from various teams into superior results.
Preventing Rogue Agents Improves Multi-Agent Collaboration: A framework is proposed to detect "rogue agents" by monitoring agent uncertainty in real-time and to intervene accordingly. This framework achieves performance improvements of up to 17.4%, 2.5%, and 20% on the self-built WhoDunitEnv multi-agent collaboration environment, code generation tasks, and resource sustainability tasks, respectively.
Voting or Consensus? Decision-Making in Multi-Agent Debate: This work systematically compares 7 decision protocols (voting vs. consensus) in multi-agent debate (MAD). It is found that consensus protocols improve performance by 2.8% on knowledge tasks, while voting protocols improve performance by 13.2% on reasoning tasks. Two new methods, AAD and CI, are proposed to enhance answer diversity, yielding performance gains of 3.3% and 7.4%, respectively.

⚖️ Alignment & RLHF (82)¶

A Dual-Mind Framework for Strategic and Expressive Negotiation Agent: Inspired by the dual-process theory of human cognition, this paper proposes a Dual-Mind Negotiation Agent (DMNA) framework. It combines an intuitive module (fast strategic planning, trained based on MCTS+DPO) and a deliberative module (slow expression optimization, based on a multifaceted reflection mechanism) to achieve state-of-the-art performance on negotiation tasks.
AceCoder: Acing Coder RL via Automated Test-Case Synthesis: The study constructs AceCode-87K (87K coding problems + 1.38M automatically synthesized test cases) to train a code-specific Reward Model (the 7B model outperforms the 340B Nemotron). Best-of-N sampling improves Llama-3.1-8B by 8.9 points on average. Direct R1-style RL from a base model for only 80 steps improves HumanEval+ by 22.5%.
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models: This paper proposes AGD (Adversarial Game Defense), an LLM jailbreak defense method based on adversarial games. By dynamically adjusting the internal representations of the model to balance helpfulness and harmlessness, AGD significantly improves LLM safety through three stages: IQR anomaly detection, bi-level optimization game, and expert model sampling.
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic LLMs: This paper proposes the AgentAlign framework, which leverages abstract behavior chains as an intermediary to synthesize high-quality agent safety alignment data (both harmful and benign) in simulated environments. Through Supervised Fine-Tuning (SFT), AgentAlign improves the agent safety of three open-source model families by 35.8%–79.5% while maintaining or even enhancing their task capabilities.
AgentRM: Enhancing Agent Generalization with Reward Modeling: AgentRM is proposed, a generalizable reward model constructed via explicit, implicit, and LLM-as-judge approaches. It guides policy models using test-time search (Best-of-N / Beam Search), achieving an average improvement of 8.8 points across 9 agent tasks and outperforming the best generalist agent by 4.0 points.
Aligning to What? Limits to RLHF Based Alignment: Through systematic experiments, this paper finds that RLHF (including DPO, ORPO, RLOO, etc.) is fundamentally ineffective at reducing covert racial bias in LLMs. Furthermore, executing SFT prior to RLHF "solidifies" model biases, revealing the deep limitations of current alignment techniques when dealing with ambiguous goals such as bias elimination.
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models: This paper proposes the AMoPO framework, which achieves dimension-aware adaptive weight allocation by modeling the generation space as a Gaussian distribution. It completes multi-objective preference alignment without relying on reward models or reference models, outperforming the state-of-the-art (SOTA) by 28.5% on the HelpSteer2 dataset, and validating scalability on 7B, 14B, and 32B models.
ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning: Refines the granularity of preference optimization in DPO from the response level to the sentence level. By dynamically computing adaptive reward weights for each sentence based on image-text similarity and textual perplexity, it achieves average improvements of 2.57/2.87/1.98 points on LLaVA-1.5-7B/13B and InstructBLIP-13B, respectively, while significantly reducing hallucination rates.
Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization: In the SemEval 2025 LLM Unlearning Shared Task, this paper combines Negative Preference Optimization (NPO) with Low-Rank Adaptation (LoRA). By leveraging the structural properties of LoRA, the authors acquire the original model distribution with zero additional overhead to compute KL divergence regularization, significantly stabilizing the unlearning process and outperforming the task baselines.
AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs: AutoMixAlign proposes a theory-driven data mixing method for multi-task preference optimization: it first trains specialist models for each task to establish optimal loss baselines, and then adaptively adjusts data mixing proportions via minimax optimization, prioritizing tasks with the largest excess loss (gap from the specialist). It achieves an average improvement of 9.42% in helpfulness/harmlessness/reasoning multi-task DPO.

Browse all 82 Alignment & RLHF papers →

🔒 LLM Safety (55)¶

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models: This paper comprehensively revisits Membership Inference Attacks (MIA) in LLMs from a statistical perspective through thousands of experiments. It analyzes the inconsistency of MIA performance across six dimensions: data splitting methods, model size, domain characteristics, text features, embedding separability, and decoding dynamics. It reveals previously overlooked findings such as threshold generalization, the impact of text length/similarity, and emergent changes in the embedding layers.
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection: Proposes AGrail, a lifelong learning LLM Agent guardrail framework. Through dual-LLM collaboration (Analyzer + Executor) and a memory module, it adaptively generates and optimizes safety check policies at test time, effectively defending against task-specific and systemic risks.
Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning: This paper proposes "In-Context Knowledge Unlearning" by introducing special unlearning tokens <<UNL>>...<</UNL>> to enable LLMs to selectively forget specific knowledge during inference based on context. It achieves a 95% unlearning accuracy on TOFU/AGE/RWKU while retaining 80% of irrelevant knowledge. In-depth internal analysis reveals that LLMs do not truly delete the knowledge but rather "pretend to forget" it at the final layer.
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs: This paper challenges the prior conclusion that LLM hidden states can encode the truthfulness of facts. By constructing more realistic and challenging datasets (perplexity-guided negative sampling and QA-based LLM generation datasets), the authors find that prior methods exhibit limited generalization on data that closer resembles real-world scenarios, providing a more rigorous benchmark and practical guidance for LLM factuality evaluation.
Bias in the Mirror: Are LLMs' Opinions Robust to Their Own Adversarial Attacks: This paper proposes a novel "self-debate" paradigm where two instances of the same LLM play the proponent and opponent to debate each other, attempting to persuade a neutral version of the model. This setup is used to evaluate the robustness of LLMs' intrinsic bias—specifically, whether the bias is easily swayed and whether the model is susceptible to being misled by its own adversarial arguments.
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks: This paper proposes the CAVGAN framework, which utilizes generative adversarial networks to simultaneously learn jailbreak attacks (generator) and safety defense (discriminator) within the internal representation space of LLMs. This is the first work to unify attack and defense into a single framework for mutual enhancement, achieving an average attack success rate of 88.85% and an average defense success rate of 84.17%.
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models: Proposes Chinese SimpleQA, the first comprehensive Chinese factuality evaluation benchmark, containing 3,000 high-quality short Q&A pairs (covering 6 main domains and 99 sub-domains). After evaluating 41 LLMs, only o1-preview (63.8%) and Doubao-pro-32k (61.9%) passed. The study systematically reveals key insights such as "larger models perform better," "RAG narrows the gap," and "alignment lowers factuality."
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP: This work proposes CLIPErase, a machine unlearning framework tailored for multimodal CLIP models. By synergistically integrating a Forgetting Module, a Retention Module, and a Consistency Module, it selectively removes specified vision-language associations while preserving the performance of the model on the retained data.
ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty: The ComparisonQA benchmark (283K paired questions) is constructed to achieve controlled comparisons by having high- and low-frequency entities share the same abstract question. Combining a two-stage evaluation method of accuracy and uncertainty, the study reveals that LLMs (including GPT-4o) exhibit extremely poor robustness to low-frequency knowledge.
Core: Robust Factual Precision with Informative Sub-Claim Identification: This paper proposes the Core framework, which achieves robust factual precision evaluation by identifying and filtering informative sub-claims, addressing the issue of inaccurate evaluation in existing methods caused by the dilution effect of uninformative claims.

Browse all 55 LLM Safety papers →

👻 Hallucination Detection (27)¶

Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention: This paper proposes ASD (Activation Steering Decoding), a training-free, inference-time hallucination mitigation method. By identifying hallucination direction patterns within the intermediate hidden states of LVLMs, it leverages bidirectional steering and contrastive decoding to suppress hallucinated outputs while preserving the model's performance on general visual understanding tasks.
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering: This paper proposes the NOVA framework, which filters knowledge-aligned, high-quality instruction data by measuring the LLM's familiarity with instructions via Internal Consistency Probing (ICP) and familiarity with target responses via Semantic Equivalence Identification (SEI). Fine-tuning LLaMA-3-8B with only 5% of selected data achieves an 8.6-point improvement on BioGEN and a 7.2-point improvement on FollowRAG, while preserving instruction-following capability.
Alleviating Hallucinations from Knowledge Misalignment in Large Language Models via Selective Abstention Learning: To address the hallucination issue in LLMs caused by knowledge misalignment (inconsistency between model parametric knowledge and reality), this paper proposes a Selective Abstention Learning method. This approach enables the model to actively refuse to answer when encountering questions outside its knowledge boundary instead of fabricating content, thereby reducing hallucinations.
Automated Explanation Generation and Hallucination Detection for Heritage Image Retrieval: This paper proposes a framework combining automated explanation generation and hallucination detection for cultural heritage image retrieval. It utilizes vision-language models to generate explainable text descriptions for retrieval results, while ensuring the factual accuracy of descriptions through a domain-knowledge-constrained hallucination detection mechanism, validating the effectiveness of the method on multiple cultural heritage datasets.
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models: This paper proposes the first joint cross-lingual and cross-modal hallucination detection benchmark, CCHall, covering 9 languages and 4 types of multimodal datasets. It systematically evaluates the hallucination performance of 6 mainstream MLLMs in joint scenarios, revealing that the F1 score of current models in this joint scenario is 10.9% lower than that of cross-modal alone, and 3.4% lower than that of cross-lingual alone. Additionally, two mitigation paths are proposed: multilingual prompting and external tool assistance.
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge: This paper systematically explores the performance of two self-correction methods (CoVE and RARR) in correcting hallucinations in news summaries. By comparing three search engines, multiple retrieval settings, and prompting strategies, it is found that the combination of Bing search snippets and RARR (few-shot) yields the best performance, with G-Eval aligning closely with human evaluations.
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence: Proposes the VHD metric to quantify how sensitive the output of each attention head is to visual input. It finds that only a few attention heads are highly sensitive to visual information, and the model's over-reliance on language priors is a key factor causing hallucinations. Based on this, a training-free method, VHR, is designed to adaptively reinforce the contribution of vision-sensitive heads layer-by-layer ($\alpha=2$), reducing the CHAIR$_S$ of LLaVA-1.5 on CHAIR from 49.68 to 33.32, with almost no additional inference overhead.
DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination: DRAG proposes a framework to distill RAG capabilities from large language models (LLMs) to small language models (SLMs): utilizing an LLM (e.g., GPT-4o) to generate evidence and knowledge graph triples for a given question. After ranking and filtering, these are fed to SLMs (2B-9B) as structured contexts, boosting SLM performance on ARC-C by up to 27.7% without fine-tuning, while significantly mitigating hallucinations.
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries: Proposes the Entity Tracing Framework (ETF), a hallucination detection framework that extracts code entities via static program analysis and verifies whether these entities are correctly described in the generated summaries using LLMs. Combined with the first-of-its-kind CodeSumEval dataset (~10K samples), it achieves a 73% F1 score in code summary hallucination detection.
FIHA: Autonomous Fine-grained Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs: This paper proposes FIHA, an automated, fine-grained hallucination evaluation framework that requires neither LLMs nor human annotations. By extracting entities, attributes, and relations from images and descriptions to generate Q&A pairs, and introducing Davidson Scene Graphs (DSG) to model inter-question dependencies, the authors construct the FIHA-v1 benchmark to comprehensively evaluate the hallucination levels of mainstream Large Vision-Language Models.

Browse all 27 Hallucination Detection papers →

📊 LLM Evaluation (89)¶

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates: A conformal risk control framework is proposed for granular word-level error detection and uncertainty calibration of CLIPScore. By generating a score distribution through simple attention mask sampling, this method provides formal risk control guarantees while remaining model-agnostic.
MisMatched: A Benchmark for Scientific Natural Language Inference: Introduces MisMatched—the first scientific NLI evaluation benchmark covering non-CS fields (Psychology, Engineering, Public Health), consisting of 2,700 human-annotated sentence pairs. The best SLM baseline (SciBERT) achieves a Macro F1 of only 78.17%, while the best LLM baseline (Phi-3) scores only 57.16%. It also proves that training with implicit relation sentence pairs can improve model performance.
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research: Proposes AbGen—the first benchmark to evaluate the ability of LLMs to design ablation studies (1,500 expert-annotated data points from 807 NLP papers). It reveals that the strongest LLM (DeepSeek-R1) falls behind human experts by 14.4%, and LLM-as-Judge scores are highly inconsistent with human evaluations.
Access Denied Inc: The First Benchmark Environment for Sensitivity Awareness: This work formally defines the concept of LLM "Sensitivity Awareness" (SA) for the first time—evaluating whether an LLM can decide whether to provide information based on Role-Based Access Control (RBAC) rules. The authors construct an automated evaluation benchmark, Access Denied Inc, and find that even with highly structured data and minimalist rules, the best-performing model, Grok-2, still exhibits a leak rate of 18.28% across 7 mainstream LLMs.
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models: The board game Codenames is implemented as an LLM evaluation benchmark, where LLMs play the roles of both Spymaster (clue giver) and Field Operative (guesser) against a deterministic opponent across 13 experimental setups of varying difficulty. Among 14 evaluated models, the best-performing model (o3-mini) achieves a win rate of only 49%, revealing substantial limitations of LLMs in vocabulary association, strategic positioning, and error correction.
AD-LLM: Benchmarking Large Language Models for Anomaly Detection: This paper proposes the first LLM anomaly detection benchmark, AD-LLM, to systematically evaluate the capability of LLMs in three core tasks: zero-shot detection, data augmentation, and unsupervised model selection. It reveals that GPT-4o zero-shot detection outperforms traditional training-based methods on most datasets. Additionally, synthetic data benefits detectors utilizing flexible representation learning but harms models with fixed geometric assumptions. Finally, reasoning LLMs achieve near-optimal model selection, though their explanations lack explicit dataset specificity.
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents: AndroidLab is proposed as a systematic Android agent evaluation and training framework, consisting of a unified operating environment, a reproducible benchmark with 138 tasks, and an instruction dataset of 94.3K steps. Through fine-tuning, the success rate of open-source LLMs is improved from 4.59% to 21.50%.
AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge: Proposed AntiLeakBench, an automated anti-leakage benchmark framework that identifies new knowledge post-LLM cutoff dates by tracking Wikidata knowledge update histories, and automatically constructs single-/multi-hop QA test samples (with real-world Wikipedia supporting documents) to ensure strict knowledge-level zero contamination. Large-scale experiments on 12 LLMs demonstrate a pervasive post-cutoff performance decline (with significant EM drop), validating the framework's effectiveness.
Are Bias Evaluation Methods Biased?: Under strictly controlled variables, this study compares three mainstream bias evaluation methods (structured Q&A BBQ, LLM-as-a-Judge, and sentiment analysis) and finds that different methods yield significantly different bias rankings for the same set of LLMs—suggesting that bias evaluation methods themselves are biased, and enterprises should not rely on a single bias benchmark for model selection.
Atomic Calibration of LLMs in Long-Form Generations: This work systematically studies atomic calibration in long-form generation, categorizing confidence elicitation methods into discriminative and generative approaches. It finds these two types to be complementary and proposes a fusion strategy based on confidence consistency, revealing interesting patterns in how model confidence changes during the generation process.

Browse all 89 LLM Evaluation papers →

⚡ LLM Efficiency (42)¶

A Drop-In Solution for On-the-Fly Adaptation of Speculative Decoding in Large Language Models: This paper proposes a drop-in adaptive solution for speculative decoding that dynamically adjusts the speculative window size $\gamma$ (and potentially the choice of draft models) during inference, thereby maximizing the end-to-end speedup of speculative decoding under diverse input distributions.
Accelerating Speculative Decoding via Efficient Context-Aware Draft Generation: This paper proposes an efficient context-aware draft generation strategy to accelerate speculative decoding. By enabling the draft model to dynamically adjust the generation quality based on the current context, it significantly improves LLM inference throughput while maintaining output consistency.
LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training: Ours proposes LaMPE (Length-aware Multi-grained Positional Encoding), which adaptively determines the optimal mapping length using a parameterized scaled sigmoid function, and designs a three-region multi-grained attention mechanism (fine-grained local head + linearly normalized and compressed middle + tail that restores long-range dependencies) to achieve training-free plug-and-play context window extrapolation for LLMs, comprehensively outperforming existing methods on five major long-context benchmarks.
Boosting Long-Context Information Seeking via Query-Guided Activation Refilling: This paper proposes ACRE (Activation Refilling), which constructs a bi-layer KV Cache architecture—consisting of an L1 layer to compactly capture global information and an L2 layer to provide detailed local information. By using the input query to dynamically replenish relevant items from L2 to L1, ACRE achieves highly efficient processing of long-context information retrieval tasks, with significant improvements in both performance and efficiency.
CLaSp: In-Context Layer Skip for Self-Speculative Decoding: CLaSp proposes a training-free self-speculative decoding method that dynamically adjusts the layer skipping strategy based on context after each verification step using a dynamic programming algorithm. By utilizing the full hidden states of the previous verification step as the target to select the optimal set of skipped layers, it achieves $1.3-1.7\times$ speedup on the LLaMA3 series without altering the generation distribution.
CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels: The authors construct CNNSum—a multi-scale long-text summarization benchmark based on Chinese novels (695 samples, 16k-128k tokens)—ensuring quality via human annotation. Through a systematic evaluation of 20+ LLMs, they discover that advanced LLMs tend to generate subjective commentary leading to vague summaries, smaller models offer better cost-effectiveness, fine-tuning Base versions yields superior results over Chat versions, and fine-tuning on short-context data alone can significantly enhance long-text summarization capabilities.
Consistency-Preserving Contrastive Decoding for Faithful Document-Grounded Dialogue: This paper proposes Consistency-Preserving Contrastive Decoding (CPCD), a method that contrasts document-conditioned and document-free generation distributions during the decoding phase. This strategy enhances the faithfulness of document-grounded dialogue systems to source documents while maintaining response fluency and dialogue consistency.
Consultant Decoding: Yet Another Synergistic Mechanism: This paper proposes Consultant Decoding (CD), a novel cooperative decoding mechanism that verifies draft tokens based on the target model's negative log-likelihood (NLL). Compared to the likelihood-ratio verification methods of traditional speculative decoding, CD significantly improves the acceptance rate, reduces the frequency of target model calls, and maintains or even exceeds the generation quality of the target model.
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis: Proposes a cross-layer knowledge attribution algorithm to systematically analyze the "basic-refinement" collaboration framework of shared experts and routed experts in MoE models, revealing that MoEs achieve 31% higher layer-wise efficiency compared to dense models, and validating the decisive impact of architectural depth on robustness through a semantic-driven routing mechanism (attention head-expert correlation $r=0.68$) and expert blocking experiments.
Giraffe: Design Choices for Extending the Context Length of Visual Language Models: This work systematically explores the design space for extending the context window of existing Visual Language Models (VLMs) to 128K. It proposes best practices across three dimensions—data recipe, positional encoding extension, and context utilization—and introduces two techniques: M-RoPE++ and hybrid-resolution training. The resulting Giraffe model achieves state-of-the-art (SOTA) performance among long-context VLMs.

Browse all 42 LLM Efficiency papers →

📚 Pretraining (40)¶

Adversarial Tokenization: This paper finds that while the BPE tokenizer in the LLM pipeline uses only a single unique word segmentation method, there are exponentially many valid segmentations for the same string. By adversarially selecting non-standard tokenization schemes, safety alignment can be bypassed without changing the original text, yielding an attack success rate comparable to existing SOTA text-level attack methods.
AsyncLM: Efficient and Adaptive Async Pre-training of Language Models: This paper proposes AsyncLM, an efficient asynchronous pre-training framework that addresses the gradient staleness issue in asynchronous distributed training through adaptive gradient compensation and dynamic batch scheduling strategies, improving the throughput of large-scale language model pre-training by 1.4-1.8x while maintaining model quality comparable to synchronous training.
AutoDS: Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts: AutoDS is proposed, which uses the base language model itself as a zero-shot generative classifier to automatically evaluate mathematical text quality by calculating a continuous LM-Score from YES/NO token logits. It filters high-quality corpora for continual pre-training, achieving an approximately 2x token efficiency improvement on MATH, GSM8K, and BBH.
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases: Proposes performing "pre-pretraining" on formal languages prior to natural language pre-training, demonstrating that formal languages with hierarchical dependency structures (such as k-Shuffle Dyck) provide effective inductive biases for Transformers, enabling a 1B-parameter model to achieve the same language modeling loss with 33% fewer tokens.
Byte Latent Transformer: Patches Scale Better Than Tokens: Proposes Byte Latent Transformer (BLT), a tokenizer-free byte-level LLM architecture that aggregates bytes into variable-length patches via entropy-based dynamic grouping. It matches the performance of token-based models at the 8B scale for the first time, while unlocking a new scaling dimension of improving inference efficiency by simultaneously scaling both patch and model sizes.
Chinese Grammatical Error Correction With Pre-trained Models and Linguistic Clues: This paper proposes a Chinese grammatical error correction method that integrates pre-trained language models with multi-level linguistic clues (pinyin, glyphs, and dependency syntax). By explicitly injecting linguistic prior knowledge, it enhances the correction model's ability to identify and amend Chinese-specific error types.
CritiQ: Mining Data Quality Criteria from Human Preferences: CritiQ proposes an automatic data quality criteria mining method based on agent collaboration. With only about 30 human preference annotation pairs, it can automatically discover interpretable data quality criteria and train a scorer for efficient data selection, significantly improving the downstream performance of Llama 3.1 in code, math, and logic domains.
Data-Constrained Synthesis of Training Data for De-Identification: This work systematically investigates how to generate synthetic clinical text using domain-adapted LLMs under data-constrained conditions and how to train NER models for Personal Identifiable Information (PII) detection via machine labeling. The study reveals that the quality of the machine labeler, rather than the scale of the generative model, is the key factor determining the utility of synthetic data.
Data Caricatures: On the Representation of African American Language in Pretraining Corpora: Combining quantitative experiments, human judgment, and qualitative analysis, this work systematically evaluates the quantity and quality of African American Language (AAL) across 12 open-source pretraining corpora. It finds that AAL constitutes only 0.007%–0.18% of the documents (far below its population representation). In C4, 28.9% of AAL texts are judged inappropriate for LLM generation, and 24.5% reinforce harmful stereotypes. Furthermore, 13 out of 16 automated filters systematically favor retaining White Mainstream English (WME) over AAL.
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning: Data Whisperer proposes a training-free few-shot ICL data selection method using attention weighting. It leverages the pre-trained model's own ICL capabilities and attention scores to evaluate training samples, outperforming full-data fine-tuning with only 10% of the data while operating 7-20 times faster than existing methods.

Browse all 40 Pretraining papers →

✏️ Knowledge Editing (19)¶

A General Knowledge Injection Framework for ICD Coding: This paper proposes GKI-ICD, a general knowledge injection framework. By employing guideline synthesis and multi-task learning mechanisms, it simultaneously integrates three types of ICD knowledge—Description, Synonym, and Hierarchy—without requiring extra network modules, achieving SOTA performance on the MIMIC-III benchmark.
ToxEdit: Adaptive Detoxification Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing: ToxEdit is proposed—a toxicity-aware knowledge editing method that detects harmful hidden states in the early layers of LLM forward propagation using an SVM classifier. Through a routing mechanism, harmful inputs are directed to edited FFN replicas, while harmless inputs follow the original FFN. This achieves nearly 98% detoxification success rate and 95% instruction-following retention (DL metric) on LLaMA3-8B/LLaMA2-7B/Mistral-7B, resolving the key challenge of "detoxification vs. over-editing" in knowledge editing detoxification.
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning: Proposes BMIKE-53, a cross-lingual benchmark covering 53 languages and integrating three knowledge editing datasets (zsRE, CounterFact, and WikiFactDiff). It systematically evaluates in-context knowledge editing methods from zero-shot to 8-shot settings, revealing that writing systems (Latin vs. non-Latin) are more decisive than language families for cross-lingual editing performance, and that metric-specific exemplar strategies significantly outperform hybrid configurations.
ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains: The ChainEdit framework is proposed, which aligns logical rules mined from knowledge graphs with the intrinsic logical reasoning capabilities of LLMs to achieve chain-based updates during knowledge editing, improving logical generalization accuracy from ~20% to 58-65%.
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs: Constructs CKnowEdit, the first knowledge editing dataset oriented towards Chinese linguistic characteristics. It covers three major categories (linguistics (pinyin/ancient poetry/classical Chinese/idioms/proverbs), facts (history and geography), and logical traps (homophones/reasoning/wordplay)) with a total of 1,854 samples. It systematically evaluates the performance of five mainstream knowledge editing methods on four Chinese LLMs, revealing unique editing challenges in Chinese.
CompKe: Complex Question Answering under Knowledge Editing: Proposes the CompKe benchmark—containing 11,924 complex questions—to evaluate the performance of knowledge editing methods in complex reasoning scenarios involving one-to-many relations, logical operations (intersection/union), and condition confirmation, revealing the significant deficiencies of existing methods in complex question answering.
Context-Robust Knowledge Editing for Language Models: This work identifies that existing knowledge editing methods significantly fail when prefix contexts are present (with editing success rates dropping from 90.9% to 69.1%). It introduces the CHED benchmark to evaluate context robustness and designs CoRE, a method that enhances the context robustness of editing through diversified prefix contexts and cross-prefix hidden state variance regularization, significantly narrowing the performance gap between settings with and without context while maintaining general model capabilities.
DocMEdit: Towards Document-Level Model Editing: This paper proposes the document-level model editing task for the first time and constructs the DocMEdit benchmark containing 37,990 data items and 105,652 editing facts, revealing the severe shortcomings of existing editing methods in long-context, multi-fact parallel editing scenarios.
Efficient Knowledge Editing via Minimal Precomputation: Demonstrates that the precomputation step (caching 44 million hidden vectors) for knowledge editing methods like MEMIT/ROME/EMMET can be reduced to 2-10 times the theoretical minimum (less than 0.3% of the original size), reducing precomputation time from dozens of hours to minutes with virtually no loss in editing performance.
Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning: Proposes a four-level knowledge injection framework (Memorization → Retrieval → Reasoning → Association) and builds the DeepKnowledge synthetic evaluation platform. It systematically reveals the key factors for each level of knowledge injection: repetitive learning for memorization, diverse expressions for retrieval, and explicit reasoning patterns for deep reasoning and association, providing a complete method-level mapping for LLM knowledge updates.

Browse all 19 Knowledge Editing papers →

💬 LLM (Other) (442)¶

Towards Robust ESG Analysis Against Greenwashing Risks: A3CG: This work proposes the A3CG dataset and the Aspect-Action Analysis task (extracting aspects and their action types from sustainability claims: Implemented, Planning, or Indeterminate) to evaluate the robustness of NLP methods against greenwashing risks under cross-category generalization settings. It finds that supervised learning methods (GRACE F1 = 47.51) outperform LLMs (Claude 3.5 F1 = 42.03) but exhibit worse generalization efficiency.
A Large-Scale Real-World Evaluation of an LLM-Based Virtual Teaching Assistant: A RAG-based LLM Virtual Teaching Assistant (VTA) was deployed in a graduate-level AI programming course with 477 students at KAIST. Through longitudinal analysis of three rounds of surveys (472 respondents) and 3869 interaction logs, the study revealed that the VTA significantly reduced students' psychological barriers to asking questions. While satisfaction among high-frequency users continuously improved over time, trust in the VTA remained lower than in human TAs.
A Modular Dataset to Demonstrate LLM Abstraction Capability: This paper proposes the ArrangementPuzzle dataset and trains LLM activation classifiers, finding that the classifiers identify reasoning correctness with >80% accuracy. This reveals that LLMs encode abstract reasoning concepts distinguishing logical equivalence from semantic equivalence in middle-to-late Transformer layers.
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models: By analyzing the transition traces of latent representations during LLM inference to compute the semantic deviation of each layer, combined with a derived scaling law formula to estimate each layer's contribution to reducing loss, this paper determines "which layers to fine-tune," achieving an efficient fine-tuning approach that is orthogonal to PEFT.
SSUF: A Semi-supervised Scalable Unified Framework for E-commerce Query Classification: A unified framework for e-commerce query classification, SSUF, is proposed. It utilizes three pluggable modules—Label Enhancement (BERT semantic label encoding), Knowledge Enhancement (LLM world knowledge + posterior clicks + semi-supervised label generation), and Structure Enhancement (co-occurrence/semantic/hierarchical multi-graph fusion GCN)—to address insufficient information in short queries and the vicious cycle of the Matthew effect. SSUF achieves Macro F1 scores of 49.46 and 41.22 on JD.COM intent and category classification tasks, respectively (outperforming SOTAs like SMGCN), and has been deployed online, bringing significant commercial value.
A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm: This paper presents a systematic survey of over 80 automatic prompt optimization (APO) methods based on heuristic search algorithms, proposing a five-dimensional taxonomy (Where/What/What criteria/Which operators/Which algorithms) to unify fragmented research into a comprehensive analytical framework.
A Survey of LLM-based Agents in Medicine: How Far Are We from Baymax?: This paper systematically reviews the four-layer architecture (Profile, Clinical Planning, Medical Reasoning, and External Capacity Enhancement), four major application scenarios, and evaluation frameworks of LLM-based agents in medicine. Covering 60 studies from 2022 to 2024, it proposes four agent operational paradigms and identifies key challenges such as hallucination management, multimodal integration, and ethical concerns.
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives: This paper presents the first systematic survey framework for "data-efficient LLM post-training", categorizing methods into five major areas: data selection, data quality enhancement, synthetic data generation, data distillation & compression, and self-evolving data ecosystems, thereby constructing a comprehensive "data value flywheel" system.
A Systematic Study of Compositional Syntactic Transformer Language Models: This paper proposes a unified framework to systematically study four key design dimensions of compositional syntactic Transformer language models (SLMs): tree format, linearization strategy, composition function, and sub-constituent masking. Covering existing models and 13 new variants, this work provides multiple design recommendations for SLMs through comprehensive evaluations across five dimensions: language modeling, syntactic generalization, summarization, dialogue, and inference efficiency.
A Training-free LLM-based Approach to General Chinese Character Error Correction: Proposal of the Chinese Character Error Correction (C2EC) task, which covers substitution, missing, and redundant error types. By extending a training-free CSC method with Levenshtein distance and a prompt-based LLM, the proposed approach achieves performance on par with models up to 50 times larger using a 14B parameter model without direct fine-tuning.

Browse all 442 LLM (Other) papers →

📖 NLP Understanding (30)¶

A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment: This paper proposes the GraphMPA framework, which achieves global document understanding by constructing a hierarchical document graph based on general similarity metrics, and introduces mode-seeking preference optimization to replace traditional DPO for more precise human preference alignment, comprehensively outperforming existing RAG methods across six QA datasets.
A Variational Approach for Mitigating Entity Bias in Relation Extraction: Proposes an entity debiasing method based on Variational Information Bottleneck (VIB) that maps entity tokens to Gaussian distributions to selectively compress entity-specific information while preserving contextual semantics. This achieves SOTA performance across relation extraction datasets in generic, financial, and biomedical domains, particularly showing a notable improvement of 5.3 F1 points on BioRED in OOD scenarios.
Active LLMs for Multi-hop Question Answering: This paper proposes an active large language model framework that enables the LLM to actively decide when external information retrieval is required and when direct reasoning can be performed, thereby achieving a more efficient and accurate reasoning process in multi-hop question answering tasks.
Adapting Psycholinguistic Research for LLMs: Gender-Inclusive Language in a Coreference Context: By adapting the psycholinguistic experiment of Tibblin et al. (2023) from French to English and German LLMs, this work measures coreferent word probabilities and analyzes generated content. The findings show that: English LLMs generally maintain antecedent-coreference gender consistency, but singular they is rarely used and an underlying masculine bias persists. The German Leo Mistral 7B model exhibits a stronger masculine bias that dominates all 8 gender-inclusive strategies; nevertheless, these inclusive strategies still increase the probability of feminine/neutral gender occurrences, aligning with the results of human psycholinguistic experiments.
Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification: Proposes a political bias analysis framework for LLMs based on Target-Oriented Sentiment Classification (TSC). By substituting the names of 1,319 politicians into 450 political sentences and predicting sentiments using 7 models across 6 languages, this study defines an entropy-based inconsistency metric to quantify bias. The findings reveal that LLMs exhibit a positive bias toward left-wing and centrist politicians and a negative bias toward the far-right, with larger models demonstrating stronger and more consistent biases.
Automatic Generation of Inference Making Questions for Reading Comprehension Assessments: A reading comprehension inference question taxonomy (pronominal bridging / text-connecting / gap-filling) is developed to automatically generate multiple-choice questions for specific inference types using GPT-4o few-shot prompting; while 93.8% of the questions are of acceptable quality, only 42.6% accurately match the target inference type, indicating LLMs still lack precise control over their reasoning abilities.
BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering: This paper proposes BELLE, a bi-level multi-agent debate framework. It first classifies multi-hop questions into four types, and then dynamically plans the optimal combination scheme of operators (such as CoT, single-step retrieval, and iterative retrieval) through a bi-level debate mechanism (a first-level affirmative-negative debate + a second-level fast/slow debater supervision), realizing adaptive multi-hop reasoning tailored to different question types.
BookCoref: Coreference Resolution at Book Scale: This work proposes BookCoref, the first book-scale coreference resolution benchmark. By employing an automatic annotation pipeline integrating character linking, LLM filtering, and window expansion, it generates high-quality silver annotation data across 50 full novels, with an average document length exceeding 200k tokens.
BQA: Body Language Question Answering Dataset for Video Large Language Models: Based on the BoLD dataset, BQA is constructed via a four-step semi-automatic pipeline. BQA is a body language emotion recognition multiple-choice QA benchmark containing 7,632 short videos. Evaluation reveals that the strongest VideoLLMs (GPT-4o/Gemini) achieve an accuracy of only about 60%, which is far below human performance (85%). Furthermore, it exposes the models' over-reliance on facial expressions and significant biases towards specific racial groups.
CaLMQA: Exploring Culturally Specific Long-Form Question Answering across 23 Languages: The first multilingual long-form question answering dataset, CaLMQA (51.7K questions, 23 languages), is constructed. Culturally specific questions are collected using a translation-free approach. The study reveals that the factuality of large language models (LLMs) on culturally specific questions (45-52%) is significantly lower than on culturally neutral questions (64-71%), with low-resource languages showing particularly poor performance.

Browse all 30 NLP Understanding papers →

✍️ Text Generation (27)¶

A Representation Level Analysis of NMT Model Robustness to Grammatical Errors: A systematic representation-level analysis of how NMT encoders process grammatical errors reveals that encoders first "detect" errors in shallow layers (indicated by rising GED probing $F1$), and then "correct" them in deep layers (indicated by falling CKA distance). It proposes the concept of "Robustness Heads" to identify the specific attention heads involved in error correction, validating this two-stage "detection-then-correction" mechanism across 4 models $\times$ 5 language directions.
Abstractive Snippet Generation: This paper proposes an abstractive snippet generation method for search engines. By utilizing query-aware summarization generation techniques, it generates more concise and informative text snippets for search result pages compared to traditional extractive snippets, significantly improving the user search experience.
An Empirical Study of Many-to-Many Summarization with Large Language Models: This work presents the first systematic study of Large Language Model (LLM) performance on the Many-to-Many Summarization (M2MS) task. By integrating 8 datasets, the authors construct a benchmark containing 47.8K samples across 5 domains and 6 languages. Evaluating 18 LLMs reveals that zero-shot LLMs perform comparably to fine-tuned traditional models, and significantly outperform them after instruction tuning. However, factual consistency remains a critical bottleneck.
ATGen: A Framework for Active Text Generation: The authors propose ATGen, the first systematic active learning (AL) framework for NLG. It integrates state-of-the-art (SOTA) AL strategies, human/LLM annotation interfaces, parameter-efficient fine-tuning (PEFT), and vLLM inference optimization. Evaluation on four NLG tasks (including TriviaQA and GSM8K) demonstrates that active learning can reduce annotation costs by 2 to 4 times.
Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation: This paper proposes a systematic evaluation framework based on Context-Preserving Prefix Trees (CP-Trie) to evaluate the intrinsic adaptability of truncation sampling methods between diversity and risk using probability-free and tuning-free metrics, providing practical guidance for parameter selection in real-world applications.
CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation: Proposes CoCoLex, a training-free decoding strategy that constructs a copy distribution using the Euclidean distance between decoding hidden states and context token hidden states. By using a prediction entropy-based confidence score to dynamically balance the ratio of "copying from context" and "free generation", it consistently improves faithfulness and correctness across five legal benchmarks, showcasing particularly outstanding performance in long-text generation tasks.
Context-Aware Hierarchical Merging for Long Document Summarization: This work proposes Context-Aware Hierarchical Merging (CAHM), which effectively mitigates LLM hallucinations during ultra-long document (>100K tokens) summarization by incorporating relevant source document context (via extractive, retrieval, or citation methods) into the hierarchical merging process.
Decomposed Opinion Summarization with Verified Aspect-Aware Modules: This study decomposes the opinion summarization task into three progressively verifiable modules—Aspect Identification, Opinion Consolidation, and Meta-Review Synthesis. By using zero-shot prompting on LLMs, a domain-independent modular processing pipeline is achieved, generating more traceable and comprehensive summaries across three domains: peer reviews, business reviews, and product reviews.
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems: Through a literature review and crowdsourcing study, this work systematically compiles 21 categories of interventions to mitigate anthropomorphism in text generation system outputs. It proposes a four-dimensional conceptual framework encompassing intervention type, target behavior, operationalization, and negative impact, providing the most comprehensive infrastructure for deanthropomorphization research.
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport: Proposes MBR-OT, which introduces Optimal Transport (Wasserstein distance) into Minimum Bayes Risk (MBR) decoding to evaluate document-level output quality using sentence-level utility functions. It significantly outperforms standard MBR decoding on document-level machine translation, text simplification, and dense image captioning tasks.

Browse all 27 Text Generation papers →

🗣️ Dialogue Systems (18)¶

DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling: This paper proposes dialogue element modeling (DEMO), a novel task that systematically defines a comprehensive element taxonomy across the dialogue lifecycle from "prelude" to "epilogue." Based on this, the authors construct the DEMO benchmark covering both element awareness and dialogue agent interaction capabilities, and train DEMO agents using imitation learning, achieving superior performance on both in-domain and out-of-domain tasks.
Dialogue Systems for Emotional Support via Value Reinforcement: This paper proposes ES-VR, the first method that integrates human value reinforcement into emotional support dialogue systems. By leveraging a target value detector and a reference generator (both trained on Reddit data), combined with a two-stage SFT + DPO training scheme, the supporter model not only alleviates the seeker's negative emotions but also explores and reinforces their positive values, achieving a deeper, internal transformation.
Dynamic Label Name Refinement for Few-Shot Dialogue Intent Classification: Proposes a dynamic label name refinement method that utilizes LLMs to dynamically generate more distinctive intent label names (e.g., "Verify PAN" → "Verify PAN card details") based on retrieved examples in retrieval-based ICL intent classification. This effectively reduces confusion between semantically similar intents, consistently improving accuracy by 2.07%-7.51% across 6 datasets.
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System: This paper proposes an immersive multimodal conversation system that endows chatbots with "eyes and ears." It constructs the M3C dataset, a multi-session multi-party dialogue dataset integrating vision and audio, and designs a dialogue model consisting of a dialogue module and a multimodal memory retrieval module, enabling dynamic, long-term conversations where multiple speakers share audiovisual experiences.
Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction: A model-agnostic two-stage CRC framework (Consistency Reflection & Correction) is proposed. By first prompting the model to reflect on inconsistencies between the generated response and the dialogue context and then correcting the response accordingly, it significantly improves the consistency of generated responses with the dialogue context in goal-oriented proactive dialogue systems.
EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance: EnSToM is proposed as a lightweight method based on entropy-scaled steering vectors, which dynamically adjusts steering intensity by leveraging the differences in internal layer entropy distributions of LLMs to enhance the topic maintenance capability of task-oriented dialogue systems without modifying model parameters.
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles: This paper proposes the USP (User Simulator with Implicit Profiles) framework. By extracting implicit user profiles from human-machine dialogues and combining conditional supervised fine-tuning with cycle-consistency-based reinforcement learning, USP significantly outperforms baseline methods across three dimensions: authenticity, consistency, and diversity, improving semantic similarity and style similarity by approximately 34% and 43%, respectively.
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling: This paper proposes an Accountability Model for task-oriented dialogue systems, which integrates an additional accountability head as a binary classifier into LLMs to predict the probability of each slot in dialogue states. This enables the detection and self-correction of false positive and false negative errors, improving JGA from 64.34 to 70.51 (↑9.6%) on MultiWOZ and achieving SOTA.
KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors: This paper proposes KokoroChat, a Japanese psychological counseling dialogue dataset collected via role-playing by trained counselors, consisting of 6,589 long sessions and detailed client feedback ratings, designed to enhance the counseling response generation and dialogue evaluation capabilities of LLMs.
Exploring Persona Sentiment Sensitivity in Personalized Dialogue Generation: Large-scale analysis reveals that the quality of personalized dialogues generated by LLMs is highly sensitive to the sentiment polarity of user personas—negative personas lead to an overemphasis on persona traits that triggers contradictions, whereas positive personas generate higher-quality dialogues through selective persona integration. Based on these insights, mitigation strategies combining turn-by-turn generation, persona ranking, and sentiment-aware prompting are proposed.

Browse all 18 Dialogue Systems papers →

🌐 Multilingual & Translation (86)¶

A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs: This work systematically evaluates the zero-shot cross-lingual generalization capabilities of LLMs on three classical languages (Sanskrit, Ancient Greek, and Latin) across three NLU tasks: NER, machine translation, and question answering. It also contributes a dataset of 1,501 Sanskrit QA pairs and validates the effectiveness of RAG strategies, revealing that model scale is the decisive factor in cross-lingual generalization.
Accessible Machine Translation Evaluation For Low-Resource Languages: To address the evaluation dilemma of machine translation in low-resource languages, this paper proposes an accessible evaluation framework that does not rely on high-quality reference translations or large-scale annotated data, enabling effective translation quality assessment for resource-constrained languages.
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation: The DCSQE framework is proposed to effectively alleviate the distribution shift in synthetic QE data through constrained beam search for generating more realistic synthetic translations, an independent annotator model to correct label bias, and the SPCE algorithm to aggregate token-level labels into phrase-level labels. It outperforms state-of-the-art baselines like CometKiwi in both supervised and unsupervised settings.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT): This paper introduces HPLT v2, a large-scale multilingual dataset extracted from 4.5 PB of Internet Archive and Common Crawl data. It contains 8 trillion tokens of monolingual data covering 193 languages and 380 million parallel sentence pairs covering 51 languages, achieving significantly improved data quality through an enhanced data processing pipeline.
Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral: This work proposes UniMoral, a unified moral reasoning dataset across 6 languages that models moral reasoning as a computational pipeline containing action prediction, moral typology classification, factor attribution, and consequence generation. Benchmarking on three LLMs reveals that implicit moral context enhances models' moral reasoning capabilities, yet specialized methods are still required.
AskQE: Question Answering as Automatic Evaluation for Machine Translation: This paper proposes AskQE, a question answering-based quality estimation framework for machine translation. By generating questions from the source text, answering them based on both the source text and the back-translation output, and comparing the answers to detect translation errors, it helps users who do not understand the target language determine the acceptability of translations. On the BioMQM dataset, its Kendall's $\tau$ correlation and decision accuracy outperform existing QE metrics.
7 Points to Tsinghua but 10 Points to 清华? Assessing Agentic Large Language Models in Multilingual National Bias: This paper presents the first systematic study of national bias in LLMs acting as multilingual recommendation agents in reasoning-based decision-making tasks. Utilizing three scenarios (university application, travel, and relocation) alongside the ThurstoneケースIII (comparative judgment) method, the study quantifies rating discrepancies for GPT-3.5, GPT-4, and Claude Sonnet across six languages. The findings reveal a widespread prevalence of "local language bias," and demonstrate that Chain-of-Thought (CoT) reasoning paradoxically exacerbates bias in non-English languages.
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization: This paper systematically evaluates the correlation of n-gram and neural network evaluation metrics with human judgments across 8 languages (representing 4 morphological typology families). The authors find that n-gram metrics negatively correlate with human judgments in highly fusional languages (Arabic, Hebrew), whereas COMET, a specially trained neural metric, consistently outperforms other methods across all language typologies.
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning: Through a systematic analysis of multilingual ICL strategies, this study reveals that mixing demonstrations of various high-resource languages (HRLs) in the prompt consistently outperforms purely English demonstrations, yielding particularly significant improvements on low-resource languages (LRLs) (e.g., an 8.9% to 12.6% average LRL accuracy gain on Llama 3.1). Intriguingly, even merely appending context-irrelevant non-English sentences to the prompt yields measurable gains, revealing the phenomenon that "multilingual exposure itself is effective."
Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention: This paper proposes INCLINE (Inference-Time Cross-Lingual Intervention), a tuning-free inference-time framework. By learning an alignment matrix to transform internal representations of low-performance languages into the representation space of high-performance languages, it significantly boosts multilingual performance across 9 benchmarks and 5 LLMs.

Browse all 86 Multilingual & Translation papers →

🔍 Information Retrieval & RAG (84)¶

A Reality Check on Context Utilisation for Retrieval-Augmented Generation: This paper proposes the DRUID real-world fact verification dataset and the ACU evaluation metric, revealing that synthetic datasets (CounterFact, ConflictQA) exaggerate the impact of context features, leading to overly optimistic assessments of LLM context utilization capabilities, and calling for the study of RAG using real-world retrieved data.
A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens: This paper reveals an intriguing phenomenon in LLM text embeddings: when mapping embedding vectors back to the vocabulary space via the decoding layer, the tokens with the highest decoding probability align highly with the keywords of the input text. Furthermore, spectral analysis reveals that this phenomenon is primarily controlled by the first principal component. Based on this, a simple training-free sparse retrieval method is proposed, preserving over 80% of the original dense retrieval performance.
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps: IDR² is proposed, a model-agnostic adaptive RAG acceleration framework. By eliminating redundant representations of overlapping documents across multi-round retrieval and utilizing retrieved content to guide parallel decoding, it achieves approximately 2× end-to-end acceleration without compromising generation quality.
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark: This paper proposes AIR-Bench, the first heterogeneous IR benchmark that leverages LLMs to automatically generate test data. It covers 2 tasks (QA/Long-Doc), 9 domains, and 13 languages across 69 datasets. A three-stage quality control pipeline ensures that the generated data is highly consistent with human annotations, addressing the limitations of narrow domain coverage and high update costs in traditional IR benchmarks.
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval: This paper formally defines the Visualized Information Retrieval (Vis-IR) paradigm, which uniformly renders multimodal information into screenshots for retrieval. It constructs the VIRA dataset containing 13 million screenshots, the UniSE retrieval model family, and the MVRB benchmark, laying the foundation for unified search engines.
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search: Proposes the ARise framework, which integrates Bayesian risk assessment and dynamic RAG into Monte Carlo Tree Search to address the error propagation and verification bottleneck issues in knowledge-augmented reasoning. On multi-hop QA tasks, it outperforms state-of-the-art KAR methods by 23.10% and RAG-equipped reasoning models (DeepSeek-R1) by 25.37% in average accuracy.
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models: Astute RAG proposes a robust RAG approach against imperfect retrieval. By executing three steps—adaptive generation of internal LLM knowledge as a supplement, source-aware knowledge consolidation, and reliability-based answer generation—it significantly outperforms existing robust RAG methods on Gemini and Claude. Furthermore, it is the only method that does not perform worse than the No-RAG baseline in the worst-case scenario (where all retrieved documents are irrelevant).
Atomic LLM: A Fine-Grained Information Retrieval Evaluation Benchmark for Language Models: This paper proposes the Atomic LLM benchmark, which decomposes information retrieval evaluation into atomic-level fact retrieval tasks. It evaluates the information retrieval capabilities of LLMs across multiple dimensions, including factual precision, source attribution, and granularity coverage, revealing systematic deficiencies of existing LLMs in precise fact extraction.
Automatic Benchmark Generation from Scientific Papers via Retrieval-Augmented LLMs: This paper proposes an automated benchmark generation method based on retrieval-augmented LLMs. It automatically extracts testable knowledge points from scientific papers and generates high-quality evaluation questions. Its effectiveness has been validated across domains such as NLP, machine learning, and bioinformatics, providing a new paradigm for the rapid construction of domain-specific LLM evaluation benchmarks.
Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims: The ClaimSpect framework is proposed to automatically decompose complex claims into hierarchical aspect trees and discover supporting/neutral/opposing viewpoints along with their degree of consensus from a corpus through discriminative retrieval.

Browse all 84 Information Retrieval & RAG papers →

💻 Code Intelligence (28)¶

LongCodeU: Benchmarking Long-Context Language Models on Long Code Understanding: The authors propose the LongCodeU benchmark, which designs 8 tasks across four dimensions—code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long documentation understanding—to evaluate the comprehension capabilities of 9 long-context language models (LCLMs) on real-world, repository-level long code, revealing that 32K tokens is the practical upper limit for current LCLM long code understanding.
Beyond Sequences: Two-dimensional Representation and Dependency Encoding for Code Generation: This paper proposes a two-dimensional code representation that moves beyond traditional one-dimensional sequence representations. By explicitly encoding the structural dependency relationships of code (such as syntax tree structures and variable dependencies), it significantly improves the accuracy and structural correctness of code generation.
CoCo-Bench: A Comprehensive Code Benchmark for Multi-task Large Language Model Evaluation: This paper introduces CoCo-Bench (Comprehensive Code Benchmark), a comprehensive code benchmark covering four dimensions: code understanding, code generation, code modification, and code review. It supports multiple programming languages and difficulty levels, ensures data quality through rigorous manual review, and reveals the unbalanced performance of existing LLMs in coding capabilities.
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code: Proposes CodeDPO, which constructs high-quality preference pairs (93K correctness + 21K efficiency) from self-generated code via a PageRank-inspired self-validation scoring mechanism. After DPO training, it achieves an average improvement of over 10 points on HumanEval across 8 code models, while accelerating code execution efficiency by 1.25-1.45×.
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation: CodeIF is proposed as the first systematic benchmark to evaluate the instruction-following capabilities of LLMs in code generation. It includes 50 fine-grained constraint instructions across 8 major categories, introduces 4 new evaluation metrics, and comprehensively evaluates 35 SOTA models.
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models: The CodeReviewQA benchmark is proposed, decomposing the Automated Code Refinement (ACR) task into three intermediate reasoning steps: Change Type Recognition (CTR), Change Localization (CL), and Solution Identification (SI). Each step is formulated as a multiple-choice question-answering (MCQA) probe with different difficulty levels. Evaluated with 72 LLMs on 900 human-verified, high-quality samples (across 9 languages), it reveals specific weaknesses of models in code review comprehension.
CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System: Proposes CompileAgent, the first LLM agent framework designed for repository-level code compilation. By integrating five specialized tools and a flow-based agent strategy, it improves the compilation success rate by up to 71% on CompileAgentBench (consisting of 100 real-world C/C++ projects), costing only $0.22 per project on average.
CoRet: Improved Retriever for Code Editing: Proposed CoRet, a dense retrieval model tailored for code editing tasks. By integrating code semantics, repository-level file hierarchy, and call graph dependencies, and employing a log-likelihood loss function designed for repository-level retrieval, CoRet improves Recall by at least 15 percentage points over existing models on SWE-bench and Long Code Arena.
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal: This paper proposes DARS (Dynamic Action Re-Sampling), an inference-time compute scaling method for coding agents. It dynamically branches and attempts alternative actions at key decision points where the agent makes suboptimal choices. Using Claude 3.5 Sonnet V2, DARS achieves a 55% pass@k and 47% pass@1 on SWE-Bench Lite, outperforming the open-source SOTA frameworks of the time.
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation: This paper proposes DynaCode, a dynamic, complexity-aware code generation benchmark. By classifying code problems based on cyclomatic complexity and nesting them using Call Graphs, DynaCode dynamically generates approximately 189 million unique problems. This design effectively mitigates data contamination and systematically evaluates the code generation capabilities of LLMs across different complexity levels.

Browse all 28 Code Intelligence papers →

🎨 Image Generation (9)¶

A Unified Agentic Framework for Evaluating Conditional Image Generation: CIGEval is proposed as a unified evaluation framework based on Large Multimodal Model (LMM) Agents. By integrating various tools (Grounding, Highlight, Difference, Scene Graph) and adopting a divide-and-conquer evaluation strategy, it achieves correlation comparable to human annotators (0.4625 vs. human-to-human 0.47) across 7 conditional image generation tasks, and surpasses the SOTA GPT-4o baseline by fine-tuning a 7B model on only 2.3K training samples.
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models: Proposes D-GEN—the first open-source distractor generation model (fine-tuned LLaMA, 8B/70B) that automatically converts open-ended evaluation questions into multiple-choice formats, paired with two evaluation methods (ranking alignment and entropy analysis) to verify distractor quality, maintaining model ranking consistency with Spearman's ρ=0.99 on MMLU.
Planning with Diffusion Models for Target-Oriented Dialogue Systems: DiffTOD models dialogue planning as a trajectory generation problem, leveraging a masked diffusion language model to achieve non-sequential dialogue planning. It designs three guidance mechanisms (word-level/semantic-level/search-based) to flexibly control the dialogue toward the target, significantly outperforming baselines in negotiation, recommendation, and chitchat scenarios.
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation: This paper introduces Rectified Flow to text-to-audio generation. By leveraging bifocal samplers to optimize timestep distribution, immiscible flow to minimize total data-noise distance, and anchored optimization to correct CFG guidance errors, the proposed method achieves single-step generation with a FAD of 1.49, outperforming 100-step diffusion models while reaching a generation speed of 400x real-time.
Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models: Math2Visual proposes a framework to automatically generate pedagogical visualizations from textual descriptions of math word problems (MWPs). It defines a visual language and design space based on teacher interviews, constructs a labeled dataset of 1,903 images, and evaluates and fine-tunes multiple TTI models, revealing key deficiencies of current models in representing mathematical relationships.
Multimodal Pragmatic Jailbreak on Text-to-image Models: This paper proposes a new type of attack called "Multimodal Pragmatic Jailbreak" (MPJ). By generating images containing visual text through T2I models, the image content and text content are safe when evaluated individually but yield unsafe content when combined. This study reveals that all tested models, including DALL·E 3, are vulnerable to this attack.
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching: This paper proposes OZSpeech, the first zero-shot TTS system that combines Optimal Transport Conditional Flow Matching (OT-CFM) with a learned prior distribution to achieve one-step sampling. It significantly outperforms existing approaches in content accuracy (WER), inference speed, and model size.
R-VC: Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching: R-VC is the first zero-shot voice conversion system to achieve rhythm control. It models the target speaker's rhythm style using a Mask Transformer duration model, combined with a Shortcut Flow Matching DiT decoder to achieve efficient and high-quality speech generation in only 2 sampling steps, achieving a WER of 3.51 and speaker similarity of 0.930 on LibriSpeech.
Synthia: Novel Concept Design with Affordance Composition: Synthia proposes a novel concept design framework based on affordance composition. By leveraging a hierarchical concept ontology, an affordance sampling strategy, and curriculum learning to fine-tune a T2I model, it generates innovative designs that are both visually novel and functionally coherent.

🎬 Video Generation (2)¶

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval: Q2E proposes a zero-shot query-to-event decomposition method. It leverages the parameterized world knowledge of LLMs and VLMs to decompose simple queries into prequel, current, and sequel events. Combining these with dense video descriptions and speech transcriptions, it achieves SOTA multilingual text-to-video retrieval performance through inverse entropy fusion ranking.
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation: This work proposes VidCapBench, the first video captioning evaluation benchmark designed specifically for controllable text-to-video (T2V) generation. It evaluates caption quality across four dimensions: aesthetics, content, motion, and physical laws. Comprising 643 videos and 10,644 QA pairs, experiments demonstrate that VidCapBench scores are highly positively correlated with T2V generation quality.

🧩 Multimodal VLM (111)¶

A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models: This paper proposes the DoPL (Detail-oriented Prompt Learning) method. Based on the theory of low-entropy information concentration, it discovers shared-interest tokens between text and vision. It uses these to construct alignment weights to enhance text and visual prompts. With only 0.25M (0.12%) trainable parameters, it achieves fine-grained multimodal semantic alignment, surpassing full-parameter fine-tuning methods on six benchmarks.
Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference: This paper discovers "visual regions" within LLMs—sparse and uniformly distributed subsets of layers similar to the human visual cortex. Updating only 25% of the layers preserves 99% of visual performance while maintaining or even improving language capabilities. Based on this, the authors propose an efficient paradigm for visual-region-targeted training and pruning.
Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models: Proposes Adaptive Linguistic Prompting (ALP), an 8-shot structured prompting approach that guides multimodal LLMs to jointly reason across HTML text, screenshots, and URLs to detect phishing webpages. Combined analysis achieves an F1-score of $0.93$ on GPT-4o, outperforming traditional zero-shot baselines.
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates: This paper proposes the MAC benchmark and a diversity-promoting self-training method. By leveraging LLMs to generate deceptive texts, it systematically exposes the compositional vulnerabilities of pre-trained multimodal representations like CLIP, significantly outperforming existing methods across image, video, and audio modalities.
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents: This paper proposes Agent-RewardBench, the first benchmark to evaluate the capability of multimodal LLMs as agent reward models. It covers three dimensions (perception, planning, and safety) across seven real-world scenarios, containing 1,136 high-quality step-level samples. Experiments reveal that even the strongest model, GPT-4o, achieves only 61.4% accuracy, and stronger models surprisingly perform worse in the safety dimension.
AGRI-CM3: A Chinese Massive Multi-Modal Multi-Level Benchmark for Agricultural Understanding: This paper introduces AGRI-CM3, a large-scale Chinese multimodal and multi-level evaluation benchmark for the agricultural domain. It covers various agricultural subtasks, including crop identification, pest and disease diagnosis, and farming operation understanding, to systematically evaluate the capabilities of VLMs in the agricultural vertical domain.
AkaCE: A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues: This work constructs AkaCE—the first multimodal conversational emotion recognition dataset for an African language, covering Akan (the primary language of Ghana, with approximately 20 million speakers). It contains 385 dialogues with 6,162 utterances (spanning audio, visual, and text modalities), 308 speakers (gender-balanced with 155 males and 153 females), and provides the first word-level prosodic prominence annotations for an African language.
Aligning VLM Assistants with Personalized Situated Cognition: Based on the sociological concept of "Role-Set" to characterize user diversity, this paper proposes the PCogAlign framework. By utilizing a cognition-aware, action-oriented reward model to generate personalized responses for VLM assistants, the framework ensures that users with different roles receive advice tailored to their specific needs within the same visual scene.
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models: Proposes AlignMMBench, the first multimodal alignment evaluation benchmark for Chinese visual contexts, covering 13 tasks across 3 major categories, with 1054 images and 4978 QA pairs (including single-turn/multi-turn dialogues). Additionally, a ChatGLM3-6B-based evaluator, CritiqueVLM, is trained, which outperforms GPT-4 in evaluation consistency.
Aria-UI: Visual Grounding for GUI Instructions: This paper proposes Aria-UI, a vision-only multimodal model specifically designed for GUI visual grounding. By utilizing a scalable instruction synthesis data pipeline and a interleaved text-image action history mechanism, Aria-UI achieves state-of-the-art (SOTA) performance on both offline and online agent benchmarks, including 1st place on AndroidWorld (44.8%) and 3rd place on OSWorld (15.2%).

Browse all 111 Multimodal VLM papers →

🧠 VLM Reasoning (18)¶

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness: This work proposes AdamMeme—an adaptive evaluation framework based on multi-agent collaboration, which probes the reasoning capabilities and specific weaknesses of Multimodal Large Language Models (mLLMs) in harmful content understanding by iteratively generating more challenging meme samples.
Answering Complex Geographic Questions by Adaptive Reasoning with Visual Context and External Commonsense Knowledge: This paper proposes an adaptive reasoning framework for complex geographic questions. It combines visual context (such as maps and satellite images) with external commonsense knowledge bases for multi-step reasoning, dynamically selecting reasoning paths based on question complexity, and significantly outperforms direct end-to-end answering methods on geographic VQA tasks.
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning: This paper constructs a systematic evaluation benchmark to assess large vision-language models (LVLMs) on basic visual graph structure understanding and reasoning, finding that existing models perform poorly on such tasks, and proposes targeted improvement methods.
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs: This paper proposes a method to transfer the reasoning capabilities of LLMs to VLMs. Through improved chart representation pre-training, construction of large-scale synthetic reasoning datasets, and multi-task fine-tuning, the 5B-parameter PaLI-3 model outperforms models 10 times its size on ChartQA.
FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning: This work constructs FCMR, a cross-modal multi-hop reasoning benchmark in the financial domain, comprising three modalities: text, tables, and charts. It is categorized into three difficulty levels: Easy, Medium, and Hard. The strongest model, Claude 3.5 Sonnet, achieves only 30.4% accuracy on the Hard level, revealing critical bottlenecks of MLLMs in the information retrieval phase.
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation: This work constructs FinMME, an evaluation benchmark containing over 11,000 high-quality financial multimodal samples across 18 financial domains and 10 chart types. It proposes the FinScore evaluation framework, which integrates hallucination penalties with domain normalization. Experimental results show that even GPT-4o scores only 15.34 (with an average accuracy of 46.56%), revealing significant deficiencies of MLLMs in the financial domain.
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?: This work systematically evaluates 13 open-source small LVLMs ($\le 9\text{B}$ parameters) serving as judges for chart comprehension and reasoning tasks. It finds that some open-source models (e.g., LLaVA-Critic-7B) can achieve evaluation capabilities close to GPT-4 (about 80% agreement rate), though issues like positional bias and length bias remain prevalent.
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating: This paper proposes the LongDocURL benchmark, which covers 20 subtasks across three primary task categories: understanding, numerical reasoning, and cross-element locating. It contains 2,325 high-quality QA pairs spanning over 33,000 pages of documents. A systematic evaluation of 26 model configurations exposes key performance gaps of current LVLMs in long document understanding.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale: Proposes a scalable, low-cost method to construct MAmmoTH-VL-Instruct, a multimodal instruction tuning database of 12 million instances rich in Chain-of-Thought (CoT) reasoning, using only open-source models. The resulting model, MAmmoTH-VL-8B, achieves state-of-the-art (SOTA) performance on multimodal reasoning benchmarks (e.g., MathVerse +8.1%, MMMU-Pro +7%, MuirBench +13.3%).
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning: This paper proposes leveraging code as a supervisory signal for cross-modal alignment to construct the ImgCode-8.6M dataset consisting of 8.6 million image-code pairs, and the MM-MathInstruct-3M dataset containing 3 million multimodal mathematical instruction-tuning samples. The trained MathCoder-VL achieves State-of-the-Art (SOTA) performance in multimodal mathematical reasoning among open-source models, outperforming GPT-4o and Claude 3.5 Sonnet on geometry problems.

Browse all 18 VLM Reasoning papers →

⚡ VLM Efficiency (8)¶

EffiVLM-Bench: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models: Proposes EffiVLM-Bench, a unified evaluation framework to systematically evaluate training-free acceleration methods (token compression + parameter compression) for LVLMs across four dimensions: performance, generalization, faithfulness, and efficiency. Spanning 3 cutting-edge models and 17 benchmark tasks, it reveals the Pareto-optimal trade-offs of various methods under different compression rates.
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models: This paper proposes Hierarchical Safety Realignment (HSR), a method that first identifies safety-critical attention heads and then locates and restores safety-critical pruned neurons within these heads. With minimal parameter overhead (on the order of ten-thousandths), HSR significantly recovers the safety performance lost in pruned LVLMs.
HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval: This paper proposes HotelMatch-LLM, an asymmetric architecture that employs an SLM to encode queries and an LLM to encode hotel documents. Combined with a tri-objective multi-task optimization (retrieval alignment + MLM geographic prediction + visual facility recognition) and patch-level mean pooling for multi-image processing, it significantly outperforms SOTA methods like MARVEL and VISTA on travel-domain multimodal retrieval tasks.
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference: This paper proposes MadaKV, a modality-aware KV cache eviction strategy. Through two core components—Modality Preference Adaptation (MPA) and Hierarchical Compression Compensation (HCC)—MadaKV significantly reduces KV cache memory consumption (by 80-95%) and decoding latency (1.3x to 1.5x speedup) while maintaining performance on multimodal long-context tasks.
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval: The authors propose OMGM, a multimodal RAG system for knowledge-bound visual question answering (KB-VQA). By orchestrating the matching between query and knowledge base across various granularities and modalities via a coarse-to-fine three-step retrieval strategy, OMGM achieves state-of-the-art retrieval performance and highly competitive VQA results on InfoSeek and E-VQA datasets.
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs: This work proposes the RedundancyLens framework to systematically reveal the extensive structured and clustered redundancy in self-attention and FFN operations for visual tokens within decoder-only MLLMs. Leveraging this finding, training-free inference acceleration is achieved, which is orthogonal to and combinable with existing token compression methods.
Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding: Proposes the Sophia model to handle hour-scale long videos: accurately selects query-relevant frames via Shot-adaptive Frame Pruning (a two-stage frame pruning based on shot segmentation), and replaces full attention with Hierarchical Attention of $O(N)$ complexity. It achieves state-of-the-art (SOTA) performance on 6 out of 8 long video benchmarks, while requiring only 1/8.5 of the attention FLOPs compared to InternVL2.
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?: Large-scale benchmark experiments reveal several fundamental issues with current visual token pruning methods for MLLMs: elaborately designed pruning strategies (such as FastV and SparseVLM) underperform even naive methods like random selection and pooling on most benchmarks. This is due to positional bias in attention scores, misuse of language information, imbalance between importance and redundancy, and unreliable evaluation metrics.

🎵 Audio & Speech (46)¶

Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots: This work presents a systematic study on integrating African American English (AAE) into chatbots across text and speech modalities. It reveals that while text-based AAE hurts the user experience, speech-based chatbots paired with an African American accent are favored by AAE speakers, highlighting the crucial role of modality choice in linguistic personalization.
Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings: This paper explores utilizing cross-species acoustic pre-trained embeddings from birds and humans to identify individual calls of white-faced capuchin monkeys. It discovers that joint multi-species representations can further enhance identification performance, providing a new transfer learning paradigm for individual identification of wild animals under extreme data scarcity.
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment: This paper proposes the INTP (Intelligibility Preference Speech Dataset) dataset and multi-architecture DPO extension methods. Through preference alignment, the proposed approach significantly improves the intelligibility of zero-shot TTS systems in challenging scenarios (e.g., tongue twisters, repeated words, code-switching, and cross-lingual settings) while demonstrating weak-to-strong generalization.
AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration: This paper proposes AI4Reading, a Chinese audiobook interpretation system based on the collaboration of 11 specialized LLM Agents. It automatically generates interpretation scripts through phases of thematic analysis, case expansion, editorial refining, colloquial rewriting, and integration/revision, and then synthesizes audio using TTS. The generated interpretation scripts outperform the professional human interpretation platform, FanDeng Reading, in terms of quality (conciseness, completeness, accuracy, and coherence).
Amplifying Trans and Nonbinary Voices: A Community-Centred Harm Taxonomy for LLMs: This paper adopts a community-centred research methodology, building a specialized harm taxonomy for LLM outputs affecting Trans and Nonbinary (TNB) individuals through deep collaboration with the TNB community, thereby uncovering unique harm categories unaddressed by existing LLM safety evaluations.
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors: This paper theoretically analyzes that the root cause of cross-lingual inconsistency in Multilingual Audio-Text Retrieval (ML-ATR) is the training data distribution error. It proposes two strategies, namely 1-to-K Contrastive Learning (KCL) and Audio-English Common-Anchor Contrastive Learning (CACL), to reduce this error, achieving SOTA performance in both recall and consistency.
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models: This paper proposes ADU-Bench, a comprehensive benchmark comprising 4 sub-datasets (general dialogue, professional skills, multilingualism, and ambiguity handling) totaling over 20,000 open-ended audio dialogues. It systematically evaluates 16 Large Audio-Language Models (LALMs) on their audio dialogue understanding capabilities, revealing significant deficiencies in current models regarding mathematical formula understanding, role-playing, multilingual processing, and speech ambiguity resolution.
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models: This paper uncovers and quantitatively analyzes the Discrete Representation Inconsistency (DRI) issue in neural audio codecs—where identical audio segments are encoded into different discrete token sequences depending on context. Two constraint methods, slice consistency and perturbation consistency, are proposed to improve average consistency by 21-36% and reduce the Word Error Rate (WER) by 3.72% in VALL-E speech generation.
Autoregressive Speech Synthesis without Vector Quantization: MELLE proposes an autoregressive language model for TTS based on continuous mel-spectrogram frames. By utilizing a regression loss, a variational inference sampling module, and a spectrogram flux loss, it directly predicts continuous spectrogram frames, thereby avoiding the fidelity loss and sampling robustness issues caused by vector quantization. This single-stage model achieves speech synthesis quality comparable to human levels.
Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis: This paper proposes Chain-Talker, which achieves interpretable empathetic conversational speech synthesis through a three-stage chain modeling (emotion understanding $\rightarrow$ semantic understanding $\rightarrow$ empathetic rendering), and develops CSS-EmCap, an automatic annotation pipeline to generate emotional captions for conversational speech.

Browse all 46 Audio & Speech papers →

🔎 AIGC Detection (15)¶

A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI: This paper proposes using LLM-generated NLI explanations to substitute expensive human explanations for approximating Human Judgment Distributions (HJD). Experiments demonstrate that with the guidance of human label distributions, LLM-generated explanations achieve comparable performance to human explanations across metrics like KL divergence and JSD. Furthermore, the approach generalizes well to datasets without human explanations (MNLI) and out-of-domain test sets (ANLI).
Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media: This work offers the first large-scale quantification of the changing proportion of AI-Generated Text (AIGT) on social media. By collecting 2.4 million posts across Medium, Quora, and Reddit, constructing the AIGTBench dataset, and training the optimal detector OSM-Det, the study reveals that the AIGT proportion on Medium and Quora zoomed from ~2% to ~37-39% between 2022 and 2024, whereas Reddit's proportion only increased from 1.3% to 2.5%.
An Empirical Study on Detecting AI-Generated Text in Financial Reports: Focusing on the highly regulated domain of financial reports, this paper systematically evaluates the performance of various AI-generated text detection methods (statistical features, neural network classifiers, watermark detection, etc.) in identifying AI-generated content in financial documents, revealing the significant impact of domain specificity on detection effectiveness.
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text: Through an experiment with 1,740 annotations, it was found that human annotators who frequently use LLMs for writing tasks can detect AI-generated text with extremely high accuracy (only 1/300 errors via a 5-person majority vote). Even when facing paraphrasing and humanization evasion strategies, they perform significantly better than most automated detectors.
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data: This paper proposes ChemActor, a fully fine-tuned LLM chemical executor, which addresses the data scarcity issue in chemical synthesis action extraction through a sequential LLM-generated data framework and a distribution divergence-based data selection module, outperforming baseline models by 10% on R2D and D2A tasks.
Cognitive Framework for Detecting AI-Generated Fiction: This paper proposes an AI-generated novel/fiction detection framework based on cognitive linguistic features. By modeling cognitive patterns in human creative writing (such as narrative rhythm, emotional arc, and metaphor density), the framework distinguishes between human-written and AI-generated fictional texts, significantly outperforming existing detection methods in long-text scenarios.
Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training: This paper proposes the GREATER adversarial training framework, which simultaneously trains an adversarial attacker (Greater-A) and an MGT detector (Greater-D). The attacker identifies critical tokens through surrogate model gradients and perturbs them in the embedding space to generate adversarial samples. The detector learns generalized defense from these curriculum-style adversarial samples. Under 16 attacks, the ASR drops to 5.53% (compared to SOTA's 6.20%), while the attack efficiency is 4 times faster than SOTA.
HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring: This study proposes HACo-Det, a fine-grained machine-generated text (MGT) detection benchmark tailored for human-AI collaborative writing. By employing a multi-round local rewriting pipeline, it automatically constructs 11,200 human-AI coauthored texts with word-level attribution labels. It adapts seven mainstream detectors into a word-level sequence labeling formulation for systematic evaluation, revealing significant room for improvement in current fine-grained detection methods.
KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis: This paper constructs the first Korean LLM-generated text detection benchmark, KatFish (covering three genres and four LLMs). By analyzing three types of Korean linguistic features—word spacing, POS diversity, and comma usage—the authors propose the KatFishNet detection method, achieving an average AUROC 19.78% higher than the best baseline under the OOD (unseen LLM) setting.
Learning to Rewrite: Generalized LLM-Generated Text Detection: The Learning2Rewrite (L2R) framework is proposed, which fine-tunes an LLM-based rewriting model to amplify the difference in rewrite edit distance between human-written and AI-generated text, thereby achieving highly generalized AI text detection across domains. L2R achieves an average AUROC of 0.9009 across 21 independent domains, outperforming RAIDAR by 4.67% and direct classification fine-tuning by 51.35% in out-of-distribution tests.

Browse all 15 AIGC Detection papers →

🤖 Robotics & Embodied AI (7)¶

CHEER-Ekman: Fine-grained Embodied Emotion Classification: This paper proposes the CHEER-Ekman dataset, extending the binary embodied emotion annotations of the CHEER dataset into Ekman's six basic emotions. It employs an LLM-based automatic Best-Worst Scaling (BWS) technique to achieve fine-grained emotion classification without task-specific training, outperforming supervised BERT.
Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context: This work proposes the DICE dataset (2066 sentences, 402 idioms) to reveal systematic flaws in LLMs when contextual understanding is required to disambiguate idioms (literal vs. figurative meanings), achieved through highly controlled contrastive evaluation with identical idiom forms.
Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks: This paper proposes a dynamic framework inspired by psychological manipulation checks, utilizing LLMs to modulate the emotional intensity of arguments and systematically investigate the causal impact of emotion on argument convincingness. The findings reveal that in more than half of the cases, human judgments of convincingness are unaffected by emotional changes; when emotion does have an effect, it is more likely to enhance rather than diminish convincingness.
DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics: This paper proposes the DRAE framework, which integrates dynamic MoE routing, parametric RAG (P-RAG), a three-layer cognitive control architecture (ReflexNet-SchemaPlanner-HyperOptima), and DPMM lifelong knowledge retention. It achieves an average success rate of 82.5% on robotic manipulation and autonomous driving tasks, effectively mitigating catastrophic forgetting.
Task-aware MoILE: Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning: This paper proposes a Hierarchical Embodied Continual Learning (HEC) setting, which divides agent learning into high-level instructions and low-level actions. It designs the Task-aware MoILE method—which automatically identifies tasks through cross-modal clustering, selects LoRA experts using dual routers, and retains past knowledge via SVD orthogonal training. It reduces the forgetting rate to 3.37% across 5 incremental learning scenarios (vs. 7.44% for the previous SOTA).
SELF-PERCEPT: Introspection Improves LLMs' Detection of Multi-Person Mental Manipulation in Conversations: This paper proposes the SELF-PERCEPT two-stage prompting framework, drawing on psychological Self-Perception Theory. It guides LLMs to first observe the behavioral cues of conversational participants before inferring their internal attitudes, significantly improving the detection of mental manipulation in multi-person, multi-turn dialogues.
Vulnerability of LLMs to Vertically Aligned Text Manipulations: This paper systematically reveals the severe vulnerability of LLMs to vertically aligned text inputs: vertically aligning only a small number of keywords can lead to a drop of 25-45 percentage points in text classification accuracy. While CoT reasoning fails to mitigate this issue, a well-designed few-shot learning paradigm can effectively recover performance.

🎮 Reinforcement Learning (8)¶

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback: This paper proposes the Align-SLM framework, which applies preference optimization (DPO + RLAIF) to textless spoken language models (without text injection) for the first time. By utilizing LLMs to automatically evaluate the quality of generated speech continuations to construct preference datasets, combined with curriculum learning, the approach iteratively enhances the semantic understanding of SLMs, setting a new SOTA on benchmarks like ZeroSpeech and StoryCloze.
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient: This paper proposes a policy gradient-based structural pruning method for LLMs. By learning Bernoulli pruning masks in the probability space, it directly optimizes the loss function of the pruned model without requiring any backpropagation through the LLM itself, relying solely on forward inference to complete pruning optimization.
An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals: This paper is the first to apply Evolutionary Reinforcement Learning (ERL) to the task-oriented dialogue policy task. It proposes the EIERL method, which combines the global exploration of Evolutionary Algorithms (EA) with the local optimization of Deep Reinforcement Learning (DRL). It addresses the slow evolution of EA in the large search space of natural language through an Elite Individual Injection (EII) mechanism, achieving a more efficient exploration-exploitation balance across four datasets.
Learning to Generate Structured Output with Schema Reinforcement Learning: Proposes SchemaBench, a benchmark containing approximately 40,000 JSON schemas, and Schema Reinforcement Learning (SRL), a training framework. By utilizing a fine-grained schema validator to provide dense reward signals combined with a Thoughts of Structure (ToS) reasoning mechanism, SRL improves LLM accuracy in complex JSON generation by up to 16% without compromising general reasoning abilities.
LLM-Enhanced Self-Evolving Reinforcement Learning for Multi-Step E-Commerce Payment Fraud Risk Detection: Formulates e-commerce payment fraud detection as a multi-step MDP and utilizes LLMs (Mixtral/LLaMA/Gemma) to automatically generate and optimize RL reward functions through an evolutionary algorithm, significantly improving dollar-wise precision on real eBay transaction data compared to human-designed rewards and traditional SL baselines.
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning: Proposes MAPoRL—a post-training paradigm based on multi-agent reinforcement learning. By co-training multiple LLMs within a debate framework, integrated with verifier scoring and collaborative incentive mechanisms, it significantly enhances the effectiveness of multi-LLM collaboration and demonstrates cross-task generalization capabilities.
Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering: This paper proposes RL-Profiler, which trains a post relevance filter (SelNet) using reinforcement learning to select a small subset of posts relevant to personality traits from a user's large profile. These selected posts are then passed to an LLM for zero-shot personality prediction, thereby significantly reducing context length while maintaining prediction performance close to using all posts.
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search: TreeRL is proposed to directly integrate Entropy-Guided Parallel Tree search (EPTree) into on-policy reinforcement learning training for LLMs. By branching at tokens with high uncertainty, it expands the diversity of reasoning paths and utilizes global and local advantages derived from the tree structure as process supervision signals, surpassing traditional multi-chain sampling RL on mathematics and code reasoning tasks.

🎁 Recommender Systems (7)¶

Beyond Single Labels: Improving Conversational Recommendation through LLM-Powered Data Augmentation: To address the false negative problem in conversational recommender systems (where items users might like are incorrectly labeled as negative samples), an LLM-powered data augmentation framework is proposed. It generates synthetic labels through semantic retrieval and relevance scoring, and balances semantic relevance with collaborative information via a two-stage training strategy.
Laser: Bi-Tuning with Collaborative Information for Controllable LLM-Based Sequential Recommendation: This paper proposes the Laser framework, which inserts trainable virtual tokens as prefixes and suffixes to a frozen LLM (Bi-Tuning) to inject user-item collaborative information, and designs an MoE-based M-Former to capture diverse characteristics of different users, achieving parameter-efficient sequential recommendation.
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems: The CoVE framework is proposed to expand the LLM vocabulary by assigning a unique token ID and embedding to each item, which converts sequential recommendation into a next-token prediction task. Compared to existing methods, CoVE improves recommendation accuracy by up to 62% and achieves an approximate 100x speedup in inference, while addressing memory constraints in large-scale scenarios via hashed embedding compression.
GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion: This work proposes GRAM, a generative recommendation framework. By utilizing semantic-to-lexical translation, it encodes implicit hierarchical taxonomic and collaborative relationships of items into the LLM vocabulary space. Employing multi-granular late fusion, it independently encodes different-grained prompts and fuses them at the decoder side, yielding Recall@5 improvements of 11.5–16.0% and NDCG@5 improvements of 5.3–13.6% across four benchmarks.
KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models: Proposes KERL, a unified food recommendation system that combines the FoodKG knowledge graph with Multi-LoRA fine-tuning of Phi-3-mini. It accomplishes three functions: personalized recipe recommendation (F1=0.973), recipe generation, and micronutrient estimation, performance-wise significantly outperforming baseline LLMs and traditional embedding methods.
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences: Proposed the LOTUS leaderboard, which uniformly evaluates the detailed image captioning capabilities of Large Vision-Language Models across three dimensions: description quality (alignment, descriptiveness, language complexity), side effects (hallucinations, toxicity), and societal bias (gender, skin tone), while supporting customized evaluation based on user preferences.
RecLM: Recommendation Instruction Tuning: This work proposes RecLM, a model-agnostic recommendation instruction tuning framework. It injects collaborative filtering signals into user/item profiles generated by an LLM via two-round conversational instruction tuning, and refines profile quality using RLHF (PPO). Serving as a plug-and-play component, it consistently improves performance for BiasMF, NCF, LightGCN, SGL, and SimGCL on MIND, Netflix, and industrial datasets, demonstrating significant efficacy particularly in cold-start scenarios.

🔄 Self-Supervised Learning (7)¶

AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting: AnalyticKWS is proposed, an exemplar-free incremental learning method for keyword spotting. By freezing the feature extractor and analytically updating the classifier via recursive least squares, it outperforms all rehearsal-based methods on the GSC and SC-100 datasets with extremely low training time and memory overhead.
Improving Low-Resource Morphological Inflection via Self-Supervised Objectives: This paper systematically explores the effectiveness of 13 self-supervised auxiliary objectives (Autoencoding, CMLM, T5-style, etc.) in extremely low-resource morphological inflection tasks. It finds that autoencoding is optimal when unlabeled data is extremely scarce, whereas character-level MLM is better when data increases. Mask sampling based on morpheme boundaries represents the most promising direction.
Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing: This paper proposes an LLM Back Generation method that takes incomplete cross-domain constituency trees as input, prompting the LLM to complete the missing words to generate a treebank. It also designs a span-level contrastive learning pre-training strategy to achieve state-of-the-art performance in cross-domain constituency parsing.
Magnet: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities: This paper proposes Magnet, a method that augments decoder-only LLMs simultaneously into text encoders and infilling models using a hybrid attention mechanism (bidirectional + causal) and three self-supervised objectives (masked prediction + contrastive learning + missing span generation). It outperforms specialized methods like LLM2Vec on token-level and sentence-level representation learning tasks while avoiding the severe text repetition issue caused by bidirectionality.
QAEncoder: Towards Aligned Representation Learning in Question Answering Systems: Proposes QAEncoder, a training-free method that estimates the expected embedding of queries corresponding to a document as a proxy for the document representation, combined with a document fingerprint to maintain discriminability. This improves bge-large from 58.5 to 61.8 NDCG@10 on BEIR with zero additional storage or latency overhead.
SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction: Proposes SHuBERT (Sign Hidden-Unit BERT), migrating the masked cluster prediction paradigm of the speech self-supervised learning model HuBERT to sign language video. By clustering hand, face, and body pose streams separately and simultaneously predicting the cluster labels of masked frames, the model is pre-trained on approximately 984 hours of ASL video, achieving state-of-the-art (SOTA) on public benchmarks across translation, isolated recognition, and fingerspelling detection tasks.
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning: Proposes WhiSPA, which aligns the latent space of the Whisper audio encoder with SBERT semantic representations and psychological dimensions (emotion, personality) through contrastive learning, eliminating the dependency on an additional text LM in speech processing and reducing error by 73-84% on psychological evaluation tasks.

🔗 Causal Inference (10)¶

Causal Graph based Event Reasoning using Semantic Relation Experts: A causal event graph generation framework involving multi-round collaborative discussion among four types of semantic relation experts (Temporal, Discourse, Conditional, Commonsense) is proposed. Under a zero-shot setting, it achieves competitive results compared to fine-tuned models on multiple downstream tasks, such as event prediction and event forecasting, while providing explainable causal event chains.
CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation: Proposes CausalRAG, which integrates causal graphs into the retrieval process of RAG. It builds a text graph from documents and identifies causal relationships. During querying, it retrieves context through causal path discovery and causal summary generation, significantly improving context precision (92.86%) and retrieval recall in document question answering.
CoA-Reasoning: Explorations on Counterfactual Analysis in Physical Reasoning of LVLMs: This paper proposes the CoA-Reasoning framework to systematically evaluate and enhance the causal understanding of Large Vision-Language Models (LVLMs) in physical world reasoning by constructing counterfactual scenarios, revealing significant limitations of existing models in counterfactual physical reasoning.
Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models: This paper proposes Counterfactual-Consistency Prompting, a method that addresses the inconsistency in temporal reasoning of large language models (LLMs) by generating counterfactual questions and imposing collective constraints, achieving significant improvements across multiple temporal understanding datasets.
Counterfactual Explanations for Aspect-Based Sentiment Analysis: This paper proposes a method for generating counterfactual explanations for aspect-based sentiment analysis (ABSA). By finding the minimal text edits that flip the sentiment polarity of a specific aspect, it provides intuitive causal explanations for ABSA model predictions.
FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation: This paper proposes the FitCF framework, which leverages BERT-based feature attribution methods (such as LIME/IG/SHAP) to extract important words to guide Large Language Models (LLMs) in generating counterfactual examples under a zero-shot setting (ZeroCF). After filtering through label-flip validation, these examples are utilized as few-shot demonstrations. This approach consistently outperforms three baseline methods (Polyjuice, BAE, FIZLE) on news classification and sentiment analysis tasks.
IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery: The IRIS framework is proposed. Requiring only a set of initial variable names as input, it automatically retrieves documents, extracts variable values to construct structured data, and builds causal graphs via hybrid causal discovery (GES statistical algorithm + LLM causal verification). Additionally, it iteratively expands the variable set using a missing variable proposal component, relaxing the acyclicity and causal sufficiency assumptions of traditional methods. IRIS comprehensively outperforms 0-shot, CoT, and RAG baselines in F1 score across 6 datasets: Cancer, Diabetes, Obesity, ADNI, and Insurance.
Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning: This paper introduces Variation Theory into the Counterfactual Data Augmentation (CDA) framework, generating counterfactual samples using LLMs while preserving neuro-symbolic patterns, and incorporating a three-stage filtering pipeline to select high-quality data. This approach optimizes few-shot text classification in active learning, achieving significant F1 improvements across multiple datasets.
On the Reliability of Large Language Models for Causal Discovery: Using pre-training corpora accessible via open-source LLMs (OLMo, BLOOM), this study empirically validates the "Causal Parrot" hypothesis—that an LLM's ability to identify causal relationships is highly correlated with the frequency of that relationship in the pre-training data (Spearman $r=0.9$), and that the presence of erroneous causal relationships and changes in context significantly affect prediction reliability.
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation: This paper proposes COVER (COunterfactual VidEo Reasoning), a multi-dimensional video counterfactual reasoning benchmark. It classifies evaluation tasks into four quadrants comprising 13 categories across two dimensions (abstract-concrete and perception-cognition). By decomposing complex questions into sub-questions (necessary conditions), the benchmark reveals that sub-question accuracy is strongly correlated with counterfactual reasoning ability, and enhancing reasoning capacity is the key to improving robustness in video understanding.

🔬 Interpretability (22)¶

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability: This work proposes a dual-perspective NLG meta-evaluation framework that decomposes traditional human-metric correlation into a global perspective (ordinal classification to judge coarse-grained quality levels) and a local perspective (adjacent pairwise comparison to distinguish fine-grained quality differences). By employing an automatic benchmark construction method, it avoids manual annotation and data contamination. Experiments on 16 LLM evaluators reveal that Qwen-2.5-72B achieves global optimality, while DeepSeek-V3 performs best locally.
An Empirical Study of Mechanistic Interpretability Approaches for Factual Recall: This paper systematically compares multiple mechanistic interpretability methods (such as causal tracing, activation patching, and probing analysis) in localizing and explaining the mechanisms of factual recall in LLMs, revealing the consistencies, discrepancies, and respective application scenarios of different approaches.
Around the World in 24 Hours: Probing LLM Knowledge of Time and Place: This paper presents the GeoTemp dataset (320k prompts covering 289 cities and 37 time zones) to evaluate the capability of LLMs in joint temporal and spatial reasoning for the first time. The study finds that models can handle time calculation and geographic knowledge independently, but their performance drops sharply when combining both is required.
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages: Extends an information-theoretic bias attribution score metric to agglutinative languages (Filipino) by averaging subword scores to handle complex morphemic structures. Analysis on four multilingual PLMs reveals that bias in Filipino models is driven by entity-type topical words (people/objects/relationships), contrasting sharply with action-type topical words (crime/sexual activity) in English.
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction: This paper proposes CLEME2.0, an interpretable reference-based GEC evaluation metric. By disentangling edits into four categories (correct correction TP, wrong correction FPne, under-correction FN, and over-correction FPun) and combining them with edit weighting techniques, it achieves state-of-the-art correlation with human judgments on both GJG15 and SEEDA datasets.
Cracking Factual Knowledge: A Comprehensive Analysis of Degenerate Knowledge Neurons in Large Language Models: This paper redefines degenerate knowledge neurons (DKNs) in LLMs from both structural and functional perspectives, proposes a neural topological clustering (NTC) method to identify DKNs of arbitrary sizes and structures, and reveals the intrinsic relationships of DKNs with LLM robustness, evolvability, and complexity through 34 experiments.
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations: This paper proposes EXPERT, a reference-free image captioning evaluation metric based on VLM fine-tuning. By constructing a large-scale structured explanation dataset and designing a two-stage evaluation template, it achieves SOTA performance on multiple benchmark datasets while providing high-quality structured explanations across three dimensions: fluency, relevance, and descriptiveness.
IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory: IRT-Router borrows Item Response Theory (IRT) from psychometrics, treating LLMs as "test-takers" and queries as "exam questions." It learns multi-dimensional ability vectors along with difficulty and discrimination parameters to achieve interpretable multi-LLM routing, achieving over 87% accuracy in OOD scenarios at only 1/30 of the cost of GPT-4o.
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs: This paper discovers and defines the phenomenon of "contextual entrainment" — where LLMs assign higher probabilities to any tokens that have appeared in the context. Using a differentiable masking method, the study localizes the entrainment heads responsible for this phenomenon and demonstrates that turning off these heads significantly suppresses distraction effects.
Mechanistic Interpretability of Emotion Inference in Large Language Models: By utilizing three mechanistic interpretability techniques—probing, activation patching, and generation steering—this study reveals that the emotional representations of LLMs are functionally localized in the MHSA units of intermediate layers. Furthermore, based on cognitive appraisal theory, it demonstrates that these representations are psychologically plausible, successfully steering emotional output through interventions on appraisal concepts (such as self-agency and pleasantness).

Browse all 22 Interpretability papers →

📦 Model Compression (78)¶

500xCompressor: Generalized Prompt Compression for Large Language Models: The paper proposes 500xCompressor, which compresses up to around 500 natural language tokens into the KV values of as few as 1 special token, achieving compression ratios from 6x to 480x. It introduces only about 0.25% of additional parameters, while the LLM retains 62.26%–72.89% of its original capabilities after compression, significantly outperforming the ICAE baseline.
Accurate KV Cache Quantization with Outlier Tokens Tracing: It is discovered that a small number of outlier tokens in the outlier channels of KV Cache deviate from the previously assumed uniform distribution. To address this, the Outlier Tokens Tracing (OTT) method is proposed to dynamically trace and exclude these tokens during the quantization process. Under 2-bit quantization, this approach achieves a 6.4x memory compression and a 2.3x throughput speedup while significantly improving accuracy.
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation: AlignDistil theoretically proves the equivalence between the RLHF objective and a token-level distillation process. Based on this, it designs a simple distillation method: constructing a teacher distribution through a linear combination of logit distributions from a DPO model and a reverse DPO model, and combining this with a token-adaptive extrapolation mechanism to achieve token-level reward optimization. It outperforms existing methods on AlpacaEval 2.0, MT-Bench, and Arena-Hard while achieving faster convergence.
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs: APB proposes a distributed long-context inference framework. By introducing local KV cache compression and a mechanism to pass compressed context blocks across GPUs into the sequence parallelism framework, it achieves up to 9.2x, 4.2x, and 1.6x prefill speedup compared to FlashAttn, RingAttn, and StarAttn, respectively, without compromising task performance.
Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition: Proposes ODLRI (Outlier-Driven Low-Rank Initialization) to assign an explicit role to the low-rank component in the joint quantization and low-rank optimization (Q+LR) framework—capturing activation outlier-sensitive weights, allowing the quantized component to handle a smoother residual. This consistently reduces perplexity and improves zero-shot accuracy in 2-bit extreme quantization scenarios for Llama2/3 and Mistral.
Basic Reading Distillation: This paper proposes Basic Reading Distillation (BRD). By having a teacher LLM generate basic reading behavior data (including NER and QA) on general corpora, a small student model is trained to mimic these behaviors. This allows a 564M-parameter small model to reach or exceed the performance of a teacher model 20 times its size across various NLP tasks, without being exposed to downstream task data.
BeamLoRA: Beam-Constraint Low-Rank Adaptation: BeamLoRA observes that the importance of different ranks in LoRA modules varies significantly and evolves dynamically during training. Inspired by beam search, it proposes to dynamically evaluate rank importance inside the training process, prune unimportant ranks, and expand the parameter space for important ranks. This improves performance under a fixed total rank, consistently outperforming LoRA and its variants across 12 datasets on three base models.
Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation: This paper proposes a knowledge distillation method that goes beyond logit matching. By aligning the dynamics of feature changes (rather than static feature snapshots) of the teacher and student models during the training process, it achieves more effective knowledge transfer, significantly improving distillation performance on NLP tasks.
Beyond Text Compression: Evaluating Tokenizers Across Scales: This paper systematically evaluates the impact of 6 tokenizers on 350M and 2.7B parameter models. It finds that tokenizer selection has an extremely minor impact on English tasks but has a significant and scale-consistent impact on multilingual tasks (such as machine translation). The paper also proposes a novel family of intrinsic evaluation metrics based on Zipf's law, which predict downstream performance in multilingual scenarios significantly better than text compression rates.
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization: This is the most comprehensive empirical study of LLM quantization to date, conducting over 500k evaluations of FP8/INT8/INT4 on the entire Llama-3.1 family (8B/70B/405B). It finds that FP8 is nearly lossless, INT8 incurs only a 1-3% drop, and INT4 is surprisingly competitive, while providing recommendations for selecting quantization formats in different deployment scenarios.

Browse all 78 Model Compression papers →

🕸️ Graph Learning (24)¶

A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning: This paper proposes the Deep Generative Adaptive Replay (DGAR) method, which utilizes a pre-trained diffusion model to generate historical entity distribution representations, mitigates distribution conflicts by enhancing shared features between the historical and current distributions, and designs a layer-wise adaptive replay mechanism to integrate historical and current knowledge, significantly alleviating the catastrophic forgetting problem in continual learning scenarios for temporal knowledge graph reasoning.
A Mutual Information Perspective on Knowledge Graph Embedding: This paper proposes a Knowledge Graph Embedding (KGE) framework based on mutual information maximization. It enhances the semantic representation capability of entities and relations by maximizing the mutual information between different components of triples, achieving consistent performance improvements under complex relational patterns (e.g., 1-N, N-1).
Agent Steerable Search for Knowledge Graph Question Answering: This paper proposes an agent-based steerable knowledge graph search framework, enabling LLM agents to dynamically adjust graph search strategies (such as search depth, direction, and pruning rules) based on question types and reasoning requirements, achieving fine-grained control over the knowledge graph question answering process.
Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning: This paper proposes MERRY, a foundation model for knowledge graphs (KGs) that unifiedly handles both in-KG (zero-shot KGC) and out-of-KG (KGQA) reasoning tasks. By fusing textual and structural information via multi-view conditional message passing (CMP), MERRY outperforms existing methods across 28 datasets.
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?: This paper proposes Morpher, a multimodal prompt learning paradigm. Under extremely weak text supervision (only a few tokens of label names), Morpher aligns a pre-trained GNN into the semantic space of an LLM by simultaneously learning graph prompts and text prompts, enabling cross-task and cross-domain graph classification transfer, as well as the first CLIP-style zero-shot GNN classification prototype.
Croppable Knowledge Graph Embedding: Proposes the MED framework to train "croppable" knowledge graph embeddings—optimizing 64 sub-models of different dimensions (sharing embedding prefixes) simultaneously in a single training run. Through mutual learning, evolutionary improvement, and dynamic loss weights, sub-models of each dimension can be directly cropped and used, outperforming independent training and distillation methods while being 10 times faster to train.
Cross-Document Contextual Coreference Resolution in Knowledge Graphs: Proposes a knowledge graph-based cross-document coreference resolution method. By associating textual entity mentions with knowledge graph nodes through a dynamic linking mechanism, it combines contextual embeddings and graph message passing reasoning to improve the precision and recall of cross-document entity recognition, outperforming traditional methods on multiple benchmark datasets.
Disentangled Multi-span Evolutionary Network against Temporal Knowledge Graph Reasoning: The authors propose DiMNet, which separates the active/stable features of node semantics through a multi-span evolution strategy and a cross-time disentanglement mechanism. This significantly improves extrapolation reasoning performance on Temporal Knowledge Graphs (TKGs), achieving SOTA results across four benchmark datasets.
Extending Complex Logical Queries on Uncertain Knowledge Graphs: This paper proposes a formal framework of "soft queries" to extend complex logical queries to uncertain knowledge graphs containing confidence values, and designs the SRC method combining forward reasoning and backward calibration to answer soft queries efficiently, with theoretical proofs that errors do not catastrophically cascade.
Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors: This paper proposes the Fast-and-Frugal Text-Graph (FnF-TG) Transformer, which uniformly encodes textual descriptions and graph structure (ego-graph) via the Transformer's self-attention mechanism. It outperforms SOTA models using large BERT and MPNNs on inductive link prediction tasks using only a small BERT, while extending to a fully inductive setting (where relations can also be inductive) for the first time.

Browse all 24 Graph Learning papers →

📈 Time Series (7)¶

ANRE: Analogical Replay for Temporal Knowledge Graph Forecasting: This paper proposes the ANRE (Analogical Replay) method, which retrieves structurally similar "analogical events" from historical knowledge graph snapshots as reasoning clues to assist future event forecasting in temporal knowledge graphs, achieving significant performance improvements across multiple benchmark datasets.
Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents: Proposed a multi-perspective role-playing framework (MPR) based on LLMs. By using subjective agents to simulate user posting and an objective agent (a fine-tuned "psychologist" LLM) to audit behavioral consistency, it forecasts social media users' future emotional responses to real-time events through iterative rectification, significantly outperforming traditional methods at both macro and micro levels.
CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis: A CTPD framework is proposed, which utilizes Slot Attention to discover cross-modality shared temporal prototype patterns from multimodal EHR data (irregular time series and clinical notes). Temporal semantics of both modalities are aligned via a TP-NCE contrastive loss, achieving SOTA performance on mortality prediction and phenotype classification tasks on MIMIC-III.
G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models: This paper proposes the G2S framework, which decouples general patterns (temporal structural regularities) from scenario-specific information (concrete entities/relations) in temporal knowledge graph (TKG) forecasting. By first learning general patterns on anonymized temporal structures and then injecting scenario-specific information, G2S effectively enhances the generalization capability of LLMs in TKG forecasting.
LETS-C: Leveraging Text Embedding for Time Series Classification: LETS-C is proposed: it digitizes time series into text strings, encodes them using a text embedding model, merges them with the original time series via element-wise addition, and feeds them into a lightweight CNN-MLP classification head. With only 14.5% of the trainable parameters, it achieves SOTA, outperforming 27 baselines including OneFitsAll (fine-tuned GPT-2) on 10 UEA multivariate time series datasets.
Revisiting LLMs as Zero-Shot Time-Series Forecasters: Small Noise Can Break Large Models: This paper systematically evaluates the effectiveness of LLMs as zero-shot time-series forecasters and discovers that LLMs are extremely sensitive to input noise—even a small amount of noise can lead to a drastic performance degradation, making them underperform even simple domain-specific models (such as DLinear). The authors suggest that future research should focus on fine-tuning LLMs to better handle numerical sequences.
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement: Proposes the Time-MQA framework and the TSQA dataset (~200k QA pairs), unifying time series forecasting, imputation, anomaly detection, classification, and open-ended reasoning QA under a natural language question answering paradigm, and endowing LLMs with time series understanding and reasoning capabilities through continual pre-training.

🩺 Medical LLM (31)¶

A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment: This paper proposes a modular framework to efficiently adapt Small Language Models (SLMs) into clinical domain models. It includes pre-instruction tuning for domain experts (training multiple expert models on medical corpora), model merging (combining multiple experts into a unified MediPhi), and clinical-task alignment based on 2.5 million synthetic instructions (MediFlow). Ultimately, the 3.8B-parameter MediPhi outperforms GPT-4 on several clinical tasks.
A Retrieval-Based Approach to Medical Procedure Matching in Romanian: By modeling Romanian medical procedure name matching as a retrieval problem rather than a classification problem, under an extreme long-tail scenario of 39,097 standard entries (50% with only a single sample), this work compares BM25 sparse retrieval with three dense embeddings (mE5/RoBERT/BioClinicalBERT). After fine-tuning via metric learning, mE5 achieves 85.2% Acc@1. In real-world deployment, verification by doctors yields 94.7% accuracy, performing 1200 times faster than manual matching.
A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions: The first survey to systematically organize and review LLM research in psychotherapy using the APA three-stage (Assessment $\to$ Diagnosis $\to$ Treatment) conceptual taxonomy. Covering over 60 works, it comprehensively analyzes four levels from symptom detection to virtual therapists, revealing a four-fold imbalance across disorder coverage, language bias, methodology fragmentation, and theoretical integration.
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees' Dialogue to Facilitate Nurse Communication Training: Proposes the Adaptive-VP framework, which utilizes LLMs to build Virtual Patients (VPs) that dynamically adjust their behavior based on the communication quality of nursing trainees. Through a four-module pipeline of multi-Agent evaluation $\rightarrow$ dynamic adaptation $\rightarrow$ dialogue generation $\rightarrow$ safety monitoring, the framework significantly improves the perceived realism of VP interactions (persona fidelity $\eta_p^2 = 0.151$, dialogue realism $\eta_p^2 = 0.254$) in a between-subjects experiment with 28 nursing experts.
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset: This work constructs AfriMed-QA (15,275 questions across 32 specialties in 16 countries), the first large-scale pan-African medical QA benchmark, systematically evaluates 30 LLMs, and reveals significant regional performance gaps and the counter-intuitive phenomenon where domain-specific biomedical models underperform general-purpose models in African healthcare contexts.
Are LLMs Effective Psychological Assessors? Leveraging Adaptive RAG for Interpretable Mental Health Screening through Psychometric Practice: This paper proposes a questionnaire-guided mental health screening framework. By leveraging adaptive RAG to retrieve relevant content from users' Reddit posts, LLMs are used to fill out standardized psychometric scales (such as BDI-II) on behalf of users. It matches or outperforms supervised methods without requiring training data, while providing clinically interpretable assessment results.
ArgHiTZ at ArchEHR-QA 2025: A Two-Step Divide and Conquer Approach to Patient Question Answering for Top Factuality: A two-step "divide and conquer" approach was proposed for the ArchEHR-QA 2025 shared task: first, key sentences are extracted from electronic health records using a re-ranking model, and then a small medical LLM generates the response. This approach achieved first place in factuality and 8th/30 in overall score without using any external knowledge.
Automated Structured Radiology Report Generation: This work proposes a new task, Structured Radiology Report Generation (SRRG), which leverages LLMs to restructure free-text reports into standardized formats. It also introduces SRR-BERT, a 55-label disease classification model, and F1-SRR-BERT, an evaluation metric, addressing the challenges of report generation and evaluation caused by highly diverse reporting styles.
The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It: This paper investigates how to integrate Emergency Department (ED) patient data (vital signs, medications, triage information, etc.) into multimodal language models for automated chest X-ray report generation. It proposes a method to convert heterogeneous tabular data, text, and images into unified embeddings, which significantly improves the clinical accuracy of reports on the MIMIC-CXR + MIMIC-IV-ED datasets, outperforming multiple baseline models including CXRMate-RRG24.
Improving Automatic Evaluation of LLMs in Biomedical Relation Extraction via LLMs-as-the-Judge: This paper presents the first systematic study of LLM-as-the-Judge in evaluating biomedical relation extraction. The authors find that its accuracy is typically below 50%, and propose structured output formatting (JSON) and domain adaptation techniques to improve evaluation accuracy by approximately 15%.

Browse all 31 Medical LLM papers →

🧬 Computational Biology (6)¶

Align-Pro: Align Protein Representations Through Multi-Modal Learning: Align-Pro aligns the representations of three modalities of proteins—sequence, structure, and functional description—into a unified embedding space through a multi-modal contrastive learning framework, thereby enabling cross-modal protein retrieval, classification, and function prediction.
Concept Bottleneck Language Models For Protein Design: This paper introduces the explainability design principles of Concept Bottleneck Models (CBMs) into protein language models. By utilizing biological concepts in the intermediate layer as a bottleneck, the proposed method achieves a protein generation system that can design functional protein sequences while simultaneously providing human-understandable design rationales.
A Survey on Foundation Language Models for Single-cell Biology: This is the first systematic survey of foundation language models for single-cell biology from a language modeling perspective. It categorizes existing works into two major groups: PLMs (pre-trained from scratch) and LLMs (leveraging existing large models). The paper comprehensively analyzes tokenization strategies, pre-training/fine-tuning paradigms, and downstream task systems, while highlighting key challenges in data quality, unified evaluation, and scaling laws.
Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization: This paper proposes the KPO framework, which constructs a Protein Safety Knowledge Graph (PSKG) combined with a weighted graph pruning strategy to identify "similar but safe" protein pairs, and fine-tunes protein language models using DPO to steer them away from the hazardous sequence space while maintaining functionality.
LADDER: Language Driven Slice Discovery and Error Rectification in Vision Classifiers: LADDER "translates" the internal activations of pre-trained vision classifiers into natural language, retrieves error-related sentences, and leverages LLMs to reason out testable hypotheses regarding "which missing attributes cause the model to fail." This enables the discovery and mitigation of multiple biases in any off-the-shelf classifier without requiring any attribute annotations. It consistently outperforms baselines like Domino, Facts, and DFR across 6 natural/medical datasets and over 200 classifiers.
Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification: Proposes R2E (Retrieve to Explain), a retrieval-based framework that scores and ranks candidate answers by retrieving evidence from a literature corpus and faithfully attributes predictions to supporting evidence using Shapley values, outperforming genetics and GPT-4 baselines in drug target identification tasks.

👥 Social Computing (28)¶

A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models: This paper proposes a paradigm shift from passive detection to proactive defense, constructing a "three-pillar" framework of knowledge credibility, inference reliability, and input robustness. It systematically maps 127 defense techniques into these three pillars. A meta-analysis of 48 benchmark studies shows that proactive defense improves performance by 42–63% compared to traditional approaches, while identifying non-trivial trade-offs in computational overhead and cross-domain generalization.
BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla: This paper introduces BanStereoSet, a Bangla stereotypical bias dataset comprising 1,194 fill-in-the-blank instances covering 9 bias categories (including race, gender, religion, profession, physical appearance, age, caste, and region). It evaluates social biases in multilingual LLMs for Bangla, revealing that GPT-4o exhibits the highest bias while Mistral displays the lowest.
Beyond Negative Stereotypes -- Non-Negative Abusive Utterances about Identity Groups and Their Semantic Variants: This paper investigates a neglected type of hate speech—abusive expressions that target identity groups without containing explicit negative stereotypes. It systematically analyzes the semantic variants of such "non-negative abusive utterances" and evaluates the processing capabilities of existing detection models.
BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models: BiasGuard is proposed to detect LLM output bias by explicitly reasoning about fairness specifications. In the first stage, a teacher model generates reasoning trajectories for SFT initialization; in the second stage, DPO is utilized to enhance reasoning quality. The method outperforms classifiers and LLM-as-the-Judge approaches across 5 datasets while reducing over-fairness false positives.
Can Community Notes Replace Professional Fact-Checkers?: A large-scale analysis of 664k Twitter/X Community Notes reveals that their reliance on professional fact-checking is 5 times higher than previously reported ($\ge$5-7%). Content involving conspiracy theories/false narratives is twice as likely to cite fact-checking sources compared to other content, demonstrating that high-quality community moderation is deeply intertwined with and irreplaceable by professional fact-checking.
Conspiracy Theories and Where to Find Them on TikTok: The first systematic analysis of conspiracy theories on TikTok: collecting 1.5 million US long videos via the official API, identifying conspiracy theory content using hashtag enrichment and distant supervision (around 1,000 new videos per month), evaluating the impact of the TikTok Creator Rewards Program, and testing the effectiveness of open-source LLMs (Llama3, Mistral, Gemma) in detecting conspiracy theories based on audio transcriptions (achieving a precision up to 96% but overall performance comparable to fine-tuned RoBERTa).
Culture Matters in Toxic Language Detection in Persian: This paper systematically compares the performance of various methods (fine-tuning, data augmentation, zero/few-shot learning, cross-lingual transfer learning) in Persian toxic language detection, revealing that cultural similarity is a key factor determining the success of cross-lingual transfer learning—language data from culturally similar countries yields better transfer results.
Detection of Human and Machine-Authored Fake News in Urdu: This paper proposes a 4-way fake news detection task for Urdu (Human Fake / Human True / Machine Fake / Machine True), constructs the first Urdu machine-generated news dataset, and introduces a hierarchical detection framework that decomposes the 4-way classification into two sub-tasks: machine-generated text detection and fake news detection. It consistently outperforms baselines in both in-domain and cross-domain settings.
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection: Drawing on the Implicit Association Test (IAT) and Self-Report Assessment (SRA) from social psychology, this paper proposes a self-reflection evaluation framework to systematically study the explicit and implicit biases of LLMs. It finds that LLMs, similar to humans, exhibit an inconsistency between explicit and implicit biases—mild explicit bias but strong implicit bias—and this inconsistency becomes more severe with larger model sizes and more alignment training.
Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language: This paper constructs five gender bias evaluation datasets specifically for German and systematically evaluates them across eight multilingual LLMs, revealing unique gender bias challenges in German—including the ambiguous interpretation of masculine occupational nouns and the influence of seemingly neutral nouns on gender perception.

Browse all 28 Social Computing papers →

🛡️ AI Safety (14)¶

Building a Long Text Privacy Policy Corpus with Multi-Class Labels: This paper constructs a multi-dimensional annotated corpus (64 annotation dimensions) containing the privacy policies of 149 companies, covering contentious clauses and legal rules in EU and US privacy regulations, and establishes classification benchmarks using current large language models (LLMs).
CENTAUR: Bridging the Impossible Trinity of Privacy, Efficiency, and Performance in Privacy-Preserving Transformer Inference: This paper proposes the Centaur framework, which integrates random permutation matrices and Secure Multi-Party Computation (SMPC) to break the "impossible trinity" in Privacy-Preserving Transformer Inference (PPTI)—simultaneously achieving strong privacy protection, 5-30x speedup, and plaintext-level inference accuracy.
Crafting Privacy-Preserving Adversarial Examples: A Defense Against Membership Inference: This paper proposes a method to defend against Membership Inference Attacks (MIA) by constructing privacy-preserving adversarial examples. It injects carefully designed perturbations into the model's prediction outputs, preventing attackers from determining whether a specific data point belongs to the training set, while maintaining service quality for normal users.
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes: This paper propose Indic-Bias, the first large-scale LLM fairness benchmark tailored to the diverse Indian society. Testing 14 LLMs across three evaluation tasks using 20,000 human-verified scenario templates, it reveals that models possess severe negative biases against marginalized groups such as Dalits and reinforce stereotypes in over 70% of the cases.
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework: This work proposes GIFI (Gender Inclusivity Fairness Index), a multi-level evaluation framework covering seven dimensions: pronoun recognition, sentiment neutrality, toxicity, counterfactual fairness, stereotype association, occupational fairness, and mathematical reasoning consistency. It systematically quantifies binary and non-binary gender fairness across 22 mainstream LLMs, revealing deep bias patterns such as the complete absence of neopronouns without prompting and the over-correction of "she".
Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries: This paper proposes CEMA (Cluster and Ensemble Multi-task Text Adversarial Attack), which transforms complex multi-task black-box attacks into single-task text classification attacks by training a "deep-level surrogate model." CEMA can simultaneously attack multiple downstream tasks (such as classification, translation, summarization, and text-to-image generation) with only about 100 queries. Its effectiveness is validated on commercial models, including ChatGPT-4o, Baidu Translate, and Stable Diffusion.
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance: Proposes PrivaCI-Bench, the largest contextual privacy evaluation benchmark to date (154K instances) built upon Contextual Integrity theory. It covers real court cases, privacy policies, and synthetic data from EU AI Act compliance checkers to evaluate the legal compliance capabilities of LLMs under HIPAA, GDPR, and the AI Act.
Quantifying Misattribution Unfairness in Authorship Attribution: This paper proposes the $\text{MAUI}_k$ metric to quantify "misattribution unfairness" in authorship attribution systems—where certain authors are systematically more likely to be falsely identified as suspect authors. The study reveals that this unfairness is highly correlated with the distance of the author's embedding to the centroid in the vector space.
Robust and Minimally Invasive Watermarking for EaaS: Proposed ESpeW (Embedding-Specific Watermark), an embedding-specific watermarking method that injects unique watermarks at different positions of each embedding vector, achieving robust copyright protection for Embeddings as a Service (EaaS). It resists various watermark removal attacks while affecting the embedding quality by less than 1%.
Sandcastles in the Storm: Revisiting Watermarking Impossibility: This work challenges the theoretical impossibility results of "Watermarks in the Sand" (WITS) through large-scale experiments and human evaluation. It demonstrates that the two key assumptions of random walk attacks do not hold in practice: mixing is extremely slow (100% of attacked texts can still be traced back to their original source) and quality oracles are unreliable (only 77% accuracy), resulting in an automatic attack success rate of only 26%, which further drops to 10% after human quality auditing.

Browse all 14 AI Safety papers →

📂 Others (184)¶

Barec: A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment: This work constructs Barec—the first large-scale, balanced, and fine-grained Arabic readability assessment corpus containing over 69K sentences, 1M words, and 19 grading levels, annotated by 6 professional educators. It benchmarks 4 Arabic BERT models × 4 input variants × 5 loss functions, revealing that the morphological tokenization input D3Tok combined with regression loss achieves a QWK of 84.0%.
A Little Human Data Goes A Long Way: Through large-scale experiments on 8 fact verification and QA datasets, it is demonstrated that mixing a very small amount of human-annotated data (even as few as 125 samples) into synthetic data significantly improves model performance. Replacing the final 10% of human data leads to severe performance degradation, and the performance gain from just 200 human samples requires orders of magnitude more synthetic data to match.
A Measure of the System Dependence of Automated Metrics: Points out the ignored "system dependence" issue in machine translation automated evaluation metrics: the same metric score corresponds to different human ratings for different translation systems. The paper proposes the SysDep metric to quantify this effect, revealing that even the best WMT23 metric, XCOMET, exhibits severe system dependence that leads to incorrect rankings.
A Multi-Persona Framework for Argument Quality Assessment: This paper proposes the MPAQ framework, which simulates multiple distinct evaluator perspectives (personas) using Large Language Models to conduct multi-aspect quality assessment of arguments. It designs a coarse-to-fine scoring strategy (first integer, then decimal), significantly outperforming existing baselines on the IBM-Rank-30k and IBM-ArgQ-5.3k datasets while providing interpretable multi-perspective explanations.
A New Formulation of Zipf's Meaning-Frequency Law through Contextual Diversity: This paper proposes to reformulate Zipf's meaning-frequency law as a power-law relationship between word frequency and contextual diversity. It quantifies the number of word meanings through the directional distribution of contextualized word vectors generated by language models. The findings reveal that this law is unobservable in small-scale language models, and autoregressive LMs require significantly more parameters than masked LMs to exhibit the law.
A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs: A directed acyclic graph (DAG) based workflow framework is proposed. By decomposing the complex business constraints of an LLM agent into different state nodes in the graph and combining this with a response masking fine-tuning strategy, a production-grade e-commerce conversational agent is built. It significantly outperforms the GPT-4o baseline in both task accuracy and format adherence.
A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior: This paper proposes a unified probabilistic model of reading behavior based on marked spatio-temporal point processes. It simultaneously models when and where fixations occur and how long they last, avoiding the information loss associated with traditional aggregated measures, and reveals that surprisal has an extremely limited contribution to predicting fine-grained eye movements.
ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Clause Retrieval: Builds the first expert-annotated clause retrieval benchmark for contract drafting, ACORD (114 queries, 126K+ pairs, 1-5 star ratings). Evaluating 20 retrieval methods reveals that BM25 + GPT-4o pointwise reranking performs best (NDCG@5 = 76.9%), but the accuracy for high-quality clauses is extremely low (5-star precision@5 is only 17.2%), highlighting a significant gap between models and the actual needs of lawyers.
Adaptive Feature-based Low Rank Plus Sparse Decomposition for Subspace Clustering: This paper proposes an adaptive feature-driven low-rank plus sparse matrix decomposition method. By adaptively learning the weights of low-rank and sparse components in the feature space, it addresses the issues of noise robustness and insufficient feature discriminability in subspace clustering.
Adaptive Retrieval without Self-Knowledge? Bringing Uncertainty Back Home: This work conducts a comprehensive evaluation of 35 adaptive retrieval methods (including 8 state-of-the-art methods and 27 uncertainty estimation methods), revealing that classic uncertainty estimation techniques often outperform complex, specialized pipelines in terms of efficiency and self-knowledge capability, while maintaining comparable QA performance.

Browse all 184 Others papers →

🗂 More Areas (25)¶

🧊 3D Vision (1)¶

Slamming: Training a Speech Language Model on One GPU in a Day: This paper proposes the Slam training recipe, which systematically optimizes model initialization, architectural choices, synthetic data, and preference alignment to train a speech language model on a single A5000 GPU within 24 hours, achieving performance comparable to large-scale SLMs.

🎯 Object Detection (2)¶

Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions: This work provides the first mechanistic analysis of "anchored bias" (consistently choosing "A") in the GPT-2 family within multiple-choice questions (MCQs) from a failure-case perspective. It localizes specific value vectors storing the "A" preference in MLPs using Logit Lens, and achieves an average MCQ accuracy improvement of 70%+ through minimal intervention (updating the value vectors).
Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning: This paper proposes the Delora framework, which constructs a noisy label detector by introducing clean LoRA and noisy LoRA modules. By decoupling sample selection from model training, Delora breaks the vicious catch-22 cycle of mutual influence between sample selection and training in traditional "small-loss" approaches.

✂️ Segmentation (4)¶

BERT-like Models for Slavic Morpheme Segmentation: This paper explores the use of BERT-like pretrained language models for morpheme segmentation tasks in Slavic languages. By modeling morpheme segmentation as a sequence labeling problem, the approach achieves results superior to traditional methods across multiple Slavic languages.
DEF-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation: Proposed DEF-DTS, a dialogue topic segmentation method based on LLM multi-step deductive reasoning. Through a three-step pipeline of bidirectional context summarization $\rightarrow$ utterance intent classification (5 classes) $\rightarrow$ deductive topic shift judgment, it achieves unsupervised/prompt-based SOTA on three datasets: TIAGE, SuperDialseg, and Dialseg711, outperforming supervised methods on Dialseg711.
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning: This work proposes InstructPart, the first real-world benchmark that combines task-oriented instructions with part-level segmentation, comprising 2,400 images, 48 object categories, 44 part categories, and 9,600 human-annotated task instructions. Evaluation shows that current VLMs are severely inadequate in instruction-driven part segmentation, while a baseline fine-tuned on LISA+DINOv2 achieves an approximate 100% performance gain.
Pixel-Level Reasoning Segmentation via Multi-turn Conversations: Proposes a new task of pixel-level reasoning segmentation (Pixel-level RS) to achieve fine-grained segmentation by progressively understanding user intent through multi-turn conversations. A PRIST dataset containing 24k dialogue turns is constructed, and a MIRAS framework is designed to outperform existing baselines in both segmentation accuracy and reasoning capability.

🖼️ Image Restoration (3)¶

A Self-Denoising Model for Robust Few-Shot Relation Extraction: This paper proposes a Self-Denoising Model (SDM) to address the issue of support set label noise in few-shot relation extraction. Through the co-training of a label correction module and a relation classification module, SDM automatically corrects noisy labels and achieves more robust relation prediction, significantly outperforming baselines even in noise-free scenarios.
DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising: DiffuseDef inserts a diffusion denoiser layer between the encoder and the classifier. During training, it learns to predict noise in hidden states. During inference, it adds noise to the hidden representations, iteratively denoises them, and performs ensemble averaging. This plug-and-play approach significantly enhances the robustness of text classification models under both black-box and white-box adversarial attacks.
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy: Proposes the PreP-OCR two-stage pipeline: first restoring historical document images using a ResShift model trained on synthetically degraded data (employing multi-directional patch extraction and median fusion), and then applying ByT5 for semantic post-OCR error correction. It reduces CER by 63.9-70.3% across 13,831 pages of real-world historical documents.

🧑 Human Understanding (2)¶

Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes for Precise Commonsense Diagnosis: This paper reveals that commonsense reasoning benchmarks contain issues that appear simple on the surface but actually imply complex hidden reasoning attributes, and proposes a fine-grained diagnostic framework based on hidden reasoning attributes, enabling a more precise analysis and evaluation of models' commonsense reasoning capabilities.
TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments: This paper proposes TransBench, the first benchmark to systematically evaluate the transferability (cross-version/cross-platform/cross-app) of GUI Agents. It covers 81 Chinese Apps, 1459 screenshots, and 22K+ annotated instructions. Experiments show that fine-tuning on older versions can effectively transfer to new versions and other platforms, with Android data exhibiting the strongest generalization in cross-platform migration.

📹 Video Understanding (8)¶

A Thousand Words Paint a Picture: Multimodal Goal Tracking for Grounded Social Intelligence: This paper proposes a multimodal goal tracking framework that reasons about the implicit goals of participants in social situations by integrating visual and linguistic cues, thereby enhancing the model's understanding of social contexts (i.e., "grounded social intelligence").
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models: This study presents the first systematic exploration of selection bias in multiple-choice question answering (MCQA) with Video Language Models (VLMs). By analyzing bias sources through task decomposition, it proposes BOLD, a post-processing calibration technique that reduces bias while simultaneously improving model performance.
Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Key-Frame Extraction: This paper proposes Attention-Seeker, an unsupervised method that dynamically analyzes the attention score distribution in self-attention layers of Transformer models to extract the most representative key-frames from videos without any supervision signals. It outperforms existing unsupervised methods on multiple video summarization benchmarks.
From Teacher to Student: Tracking Memorization Through Model Distillation: This work systematically investigates the impact of knowledge distillation (KD) on the memorization behavior of large language models, finding that distillation not only compresses the model but also significantly reduces the risk of verbatim memorization of training data—with reverse KL distillation (RKLD/MiniLLM) reducing the memorization ratio from 65.4% in SFT to as low as 6.0%.
Generative Frame Sampler for Long Video Understanding: GenS is proposed, a generative frame sampling module based on VideoLLM. It outputs question-aware relevant frame intervals and confidence scores in natural language format. As a plug-and-play module, it consistently improves multiple VideoLLMs by 2-4 points on LongVideoBench, MLVU, and HourVideo.
Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples: This paper proposes CombiSearch, a method that employs combinatorial scoring to select the optimal combination of in-context examples for Dialogue State Tracking (DST). It outperforms all baselines trained on 100% of the training data using only 5% of the data. Under oracle settings, its Joint Goal Accuracy (JGA) upper bound is 12% higher than traditional methods.
RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning: This paper proposes the RAVEN framework, which integrates curriculum reinforcement learning with multimodal LLMs. Through hierarchical reward mechanisms and progressive training strategies, RAVEN achieves precise temporal grounding and category prediction of advertisement video violations, unlocking emergent reasoning capabilities without requiring explicit reasoning annotation data.
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs: Based on the observation of attention score sparsity in Video-LLMs, this paper proposes the Sparse-to-Dense (StD) decoding strategy. It uses a top-K sparse attention model as a draft model to rapidly generate candidate tokens, which are then verified in parallel by a dense model, achieving up to a 1.94× lossless acceleration without requiring additional training or architectural modifications.

🚗 Autonomous Driving (1)¶

Embracing Large Language Models in Traffic Flow Forecasting: The LEAF framework is proposed, which utilizes a dual-branch predictor comprising a graph branch (for pair-wise relations) and a hypergraph branch (for non-pair-wise relations) to generate candidate forecasts. A frozen LLM is then employed as a selector (interpreting discriminatively rather than generatively) to choose the optimal forecast, optimizing the predictor through feedback via ranking loss, achieving SOTA on PEMS datasets.

📐 Optimization & Theory (3)¶

Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race: Reveals the "race-blindness" side-effect of alignment training: Alignment prevents LLMs from representing "black/white" as racial concepts in ambiguous contexts, thus failing to activate safety guardrails and causing implicit bias to surge from 64.1% to 91.4%. Counter-intuitively, injecting race-aware activations (rather than unlearning) in early layers reduces implicit bias from 97.3% to 42.4%.
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment: Proposes AmbiK, a text-only dataset dedicated to detecting ambiguous instructions in kitchen environments. It contains 1,000 pairs of ambiguous/unambiguous instructions categorized by three ambiguity types (user preference, common sense, and safety). Multiple conformal prediction-based ambiguity detection methods are evaluated, revealing that existing methods perform poorly on this benchmark.
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting: ScaleBiO proposes a fully first-order bilevel optimization algorithm based on penalty function reformulation, applying bilevel optimization to data source reweighting for 30B+ parameter LLMs for the first time, achieving improvements of +9% on GSM8K and +5.8% on MATH for Qwen-2.5-32B.

📡 Signal & Communications (1)¶

WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications: This paper presents WirelessMathBench, a mathematical modeling benchmark for wireless communications featuring 587 problems extracted from 40 cutting-edge papers. It systematically evaluates the capabilities of LLMs in domain-specific mathematical derivations, revealing that even the strongest model, DeepSeek-R1, achieves an average accuracy of only 38.05%, and a mere 7.83% in full formula derivation.