ACL2025 LLM (Other) AI paper notes paper summaries LLM Reasoning Alignment/RLHF Agents Adversarial Robustness Dialogue

💬 LLM (Other)¶

💬 ACL2025 · 442 paper notes

📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (56) · 💬 ACL2026 (62) · 🧪 ICML2026 (39) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (54)

🔥 Top topics: LLM ×174 · Reasoning ×34 · Alignment/RLHF ×18 · Agents ×16 · Adversarial Robustness ×14

Towards Robust ESG Analysis Against Greenwashing Risks: A3CG: This work proposes the A3CG dataset and the Aspect-Action Analysis task (extracting aspects and their action types from sustainability claims: Implemented, Planning, or Indeterminate) to evaluate the robustness of NLP methods against greenwashing risks under cross-category generalization settings. It finds that supervised learning methods (GRACE F1 = 47.51) outperform LLMs (Claude 3.5 F1 = 42.03) but exhibit worse generalization efficiency.
A Large-Scale Real-World Evaluation of an LLM-Based Virtual Teaching Assistant: A RAG-based LLM Virtual Teaching Assistant (VTA) was deployed in a graduate-level AI programming course with 477 students at KAIST. Through longitudinal analysis of three rounds of surveys (472 respondents) and 3869 interaction logs, the study revealed that the VTA significantly reduced students' psychological barriers to asking questions. While satisfaction among high-frequency users continuously improved over time, trust in the VTA remained lower than in human TAs.
A Modular Dataset to Demonstrate LLM Abstraction Capability: This paper proposes the ArrangementPuzzle dataset and trains LLM activation classifiers, finding that the classifiers identify reasoning correctness with >80% accuracy. This reveals that LLMs encode abstract reasoning concepts distinguishing logical equivalence from semantic equivalence in middle-to-late Transformer layers.
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models: By analyzing the transition traces of latent representations during LLM inference to compute the semantic deviation of each layer, combined with a derived scaling law formula to estimate each layer's contribution to reducing loss, this paper determines "which layers to fine-tune," achieving an efficient fine-tuning approach that is orthogonal to PEFT.
SSUF: A Semi-supervised Scalable Unified Framework for E-commerce Query Classification: A unified framework for e-commerce query classification, SSUF, is proposed. It utilizes three pluggable modules—Label Enhancement (BERT semantic label encoding), Knowledge Enhancement (LLM world knowledge + posterior clicks + semi-supervised label generation), and Structure Enhancement (co-occurrence/semantic/hierarchical multi-graph fusion GCN)—to address insufficient information in short queries and the vicious cycle of the Matthew effect. SSUF achieves Macro F1 scores of 49.46 and 41.22 on JD.COM intent and category classification tasks, respectively (outperforming SOTAs like SMGCN), and has been deployed online, bringing significant commercial value.
A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm: This paper presents a systematic survey of over 80 automatic prompt optimization (APO) methods based on heuristic search algorithms, proposing a five-dimensional taxonomy (Where/What/What criteria/Which operators/Which algorithms) to unify fragmented research into a comprehensive analytical framework.
A Survey of LLM-based Agents in Medicine: How Far Are We from Baymax?: This paper systematically reviews the four-layer architecture (Profile, Clinical Planning, Medical Reasoning, and External Capacity Enhancement), four major application scenarios, and evaluation frameworks of LLM-based agents in medicine. Covering 60 studies from 2022 to 2024, it proposes four agent operational paradigms and identifies key challenges such as hallucination management, multimodal integration, and ethical concerns.
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives: This paper presents the first systematic survey framework for "data-efficient LLM post-training", categorizing methods into five major areas: data selection, data quality enhancement, synthetic data generation, data distillation & compression, and self-evolving data ecosystems, thereby constructing a comprehensive "data value flywheel" system.
A Systematic Study of Compositional Syntactic Transformer Language Models: This paper proposes a unified framework to systematically study four key design dimensions of compositional syntactic Transformer language models (SLMs): tree format, linearization strategy, composition function, and sub-constituent masking. Covering existing models and 13 new variants, this work provides multiple design recommendations for SLMs through comprehensive evaluations across five dimensions: language modeling, syntactic generalization, summarization, dialogue, and inference efficiency.
A Training-free LLM-based Approach to General Chinese Character Error Correction: Proposal of the Chinese Character Error Correction (C2EC) task, which covers substitution, missing, and redundant error types. By extending a training-free CSC method with Levenshtein distance and a prompt-based LLM, the proposed approach achieves performance on par with models up to 50 times larger using a 14B parameter model without direct fine-tuning.
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding: This paper proposes the Intention-aware Auditory Scene Understanding (II-ASU) paradigm and the AAD-LLM prototype system. By decoding which speaker the listener is attending to from intracranial electroencephalography (iEEG) signals and injecting this attention state into an auditory LLM, the model generates responses aligned with the listener's perception in multi-speaker scenarios.
Achieving Certification-by-Design Through Model-Driven Development: This paper proposes a "Certification-by-Design" model-driven development approach. By embedding safety certification requirements directly into the design phase of NLP systems, the final system automatically complies with relevant industry standards and regulations, reducing the high costs of post-hoc certification.
Acquisition and Application of Novel Knowledge in Large Language Models: This paper proposes the PermAR framework, which endows autoregressive models with bidirectional knowledge acquisition capabilities through permuted language modeling. It also constructs NovelHuman, a new knowledge dataset based on simulating biological evolution over knowledge graphs. The authors find that the position of knowledge within a sentence significantly affects the knowledge acquisition performance of LLMs. On new knowledge injection tasks, PermAR improves performance by 3.3%–38% compared to existing methods.
ACT: Knowledgeable Agents to Design and Perform Complex Tasks: This paper proposes the ACT framework, where multiple LLM agents collaboratively design tasks and acquire structured knowledge descriptions of each other. This allows each agent to not only grasp its own task but also understand how others operate, significantly outperforming existing multi-agent methods in complex tasks such as creative writing and tool use.
AfroBench: How Good are Large Language Models on African Languages?: Ours proposes AfroBench—a comprehensive evaluation benchmark covering 64 African languages, 15 NLP tasks, and 22 datasets. Evaluating 12 LLMs reveals that the closed-source model (GPT-4o) outperforms the best open-source model (Gemma 2 27B) by approximately 12 points, yet all LLMs still lag behind fine-tuned baselines, and the performance gap with English exceeds 40 points on open-source models.
AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration: This paper proposes AgentDropout, which dynamically eliminates redundant agents (node pruning) and redundant communication edges (edge pruning) across multiple discussion rounds, reducing prompt token consumption by 21.6% while improving task performance by 1.14 points.
AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration: This paper proposes AgentDropout, which optimizes the communication topology of multi-agent systems by dynamically eliminating redundant agent nodes and communication edges in each communication round. Compared to state-of-the-art methods, it reduces prompt token consumption by an average of 21.6% and completion token consumption by 18.4%, while improving performance by 1.14 points.
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments: This paper proposes the AgentGym framework, which features 14 interactive environments, 89 task classes, standardized trajectory datasets, and evaluation benchmarks. It also introduces the AgentEvol self-evolution algorithm, enabling LLM agents to transition from imitation to autonomous evolution through cross-environment exploration and learning, achieving performance comparable to state-of-the-art models.
AI as a Novel Ethical Agent: Exploring Moral Judgments by Large Language Models: This paper systematically explores the moral judgment capabilities of large language models (LLMs) as novel ethical agents. By constructing an evaluation benchmark covering multiple ethical frameworks, it reveals the preference patterns, consistency flaws, and cultural biases of LLMs in moral reasoning.
AIMSCheck: Leveraging LLMs for AI-Assisted Review of Modern Slavery Statements Across Jurisdictions: This work proposes AIMSCheck—an end-to-end framework for corporate modern slavery statement compliance assessment. It decomposes the evaluation task into three tiers: sentence-level multi-label classification, token-level SHAP explanation, and evidence status tracking. It also constructs two new annotated datasets, AIMS.uk and AIMS.ca, validating that models fine-tuned on Australian data can effectively generalize across jurisdictions.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study: Based on open-ended question data from the German Longitudinal Election Study (GLES), this study systematically evaluates the algorithmic fidelity of three open-source LLMs (Llama2, Gemma, Mixtral) in generating synthetic German public opinions using demographic persona prompting. The findings show that Llama2 performs best in sub-population representativeness (JS distance of 0.28), yet all models exhibit a left-leaning political bias and a reduction in within-group diversity.
Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring: Through experiments simulating teacher-student conversations using LLMs, it was discovered that while system prompting based on CEFR levels can initially constrain the difficulty of the generated Spanish text, this constraint gradually decays as the conversation rounds progress. The authors term this phenomenon "alignment drift," indicating that prompt engineering alone is insufficient for sustaining long-term adaptive language educational systems.
An Empirical Study of Iterative Refinements for Non-Autoregressive Translation: This paper conducts a systematic empirical study on iterative refinement methods in Non-Autoregressive Translation (NAT). It compares the trade-offs between translation quality and inference speed across different refinement strategies (such as CMLM, DisCo, SUNDAE, etc.), reveals the critical impacts of iteration counts, mask ratios, and training strategies on the final performance, and provides comprehensive practical guidance for NAT research.
An Empirical Study of Large Language Models for Automated Review Generation: This paper presents a systematic empirical study evaluating the capabilities of various large language models (LLMs) in automatically generating peer reviews for academic papers. It analyzes the quality, consistency, and utility of the generated reviews, uncovering the strengths, weaknesses, and directions for improvement of current LLMs in the task of review generation.
Analyzing and Mitigating Inconsistency in Discrete Speech Tokens for Neural Codec Language Models: This paper reveals the phenomenon of Discrete Representation Inconsistency (DRI) in neural audio codecs (such as EnCodec)—where the same audio segment is encoded into different token sequences depending on the presence of context. It proposes two constraint methods: slice consistency and perturbation consistency, improving representation consistency by 21-36%, which leads to a 3.72% reduction in Word Error Rate (WER) and a 5.68% improvement in speaker similarity in VALL-E speech generation.
Analyzing the Rapid Generalization of SFT via the Perspective of Attention Head Activation Patterns: This paper, through a gradient-based analysis of attention head activation patterns, reveals three key mechanisms by which SFT enables LLMs to rapidly adapt to downstream tasks: selective activation of task-specific attention heads, the activation patterns of complex tasks being linear combinations of those of basic tasks, and the ability of a small number of samples to significantly alter activation patterns. Crucially, a practical strategy is proposed to leverage basic task data to facilitate the learning of complex tasks.
APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model Prompts: This paper proposes APPL, a prompt programming language that seamlessly embeds LLM prompts into Python programs. It provides Python-native syntax, an asynchronous parallel runtime, and a traceable debugging module, simplifying the development and maintenance of complex LLM workflows.
Revisiting Common Assumptions about Arabic Dialects in NLP: This work systematically examines four widely accepted but unvalidated assumptions in Arabic dialect NLP. By expanding the NADI 2024 dataset (covering 11 country-level dialects with 33 annotators), the study reveals that these assumptions oversimplify reality: 56% of dialectal sentences are valid across multiple regions, and ADI should be modeled as a multi-label classification task.
Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching: This paper re-examines the choice of sorting algorithms in LLM-based Pairwise Ranking Prompting (PRP). It proposes a core cost model based on the number of LLM inference calls rather than the number of comparisons. It is found that the classical optimal algorithm, Heapsort, is no longer optimal when batching and caching optimizations are introduced. Quicksort reduces the number of inference calls by 44% when the batch size $\ge 2$, providing a new optimal choice for PRP sorting.
Are Your LLMs Capable of Stable Reasoning?: Proposes the G-Pass@k evaluation metric and the LiveMathBench dynamic benchmark to comprehensively evaluate the reasoning capabilities of LLMs from two dimensions: "performance ceiling" and "stability", revealing that current LLMs still have significant room for improvement in reasoning consistency.
Argument Mining in the Age of Large Language Models: This paper systematically investigates the current status and challenges of Argument Mining (AM) tasks in the era of Large Language Models. Through comprehensive experiments, it evaluates the performance of LLMs on subtasks such as argument component identification, argument relation classification, and argument quality assessment, proposes targeted improvement strategies, and reveals the advantages and limitations of LLMs in structured argument understanding.
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving: This work proposes ArithmAttack, which evaluates the robustness of LLMs by randomly inserting punctuation marks into math problem contexts (without altering any words). It reveals that eight popular LLMs (including Llama3, Mistral, and DeepSeek) suffer significant performance degradation when facing such simple noise.
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution: Proposes the ASPERA framework and the Asper-Bench benchmark to evaluate the ability of LLMs to execute complex multi-step action planning (program generation) under the constraints of a custom assistant library, revealing that program generation based on custom API libraries poses a significantly greater challenge for LLMs compared to free-form code generation.
Assessing and Enhancing the Causal Reasoning Abilities of Language Models via Faithful Textual Interpretation: This paper proposes a framework based on Faithful Textual Interpretation (FTI), which evaluates and enhances the causal reasoning abilities of LLMs by faithfully translating variable relationships in causal reasoning tasks into natural language descriptions, achieving significant improvements across multiple causal reasoning benchmarks.
Assessing the Vulnerability of LLMs to Cognitive Biases in Scientific Research: This paper systematically evaluates the vulnerability of Large Language Models (LLMs) to various cognitive biases in scientific research scenarios. By constructing a scientific reasoning test suite covering confirmation bias, anchoring effect, availability bias, and others, the study reveals the risks of systemic biases that LLMs may introduce when assisting scientific research, and proposes mitigation strategies.
ATRIE: Automating Legal Interpretation with LLMs: Retrieval, Generation, and Evaluation: This work proposes the ATRIE framework, which simulates the doctrinal legal research workflow of legal experts. It leverages LLMs to automatically retrieve relevant information from case databases, generate interpretations of legal concepts, and evaluate interpretation quality, thereby eliminating the reliance on human legal experts.
Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models: This paper proposes Atlas (Attention-based Targeted Layer Analysis and Scaling), which localizes the layers where bias is concentrated in LLMs by analyzing attention scores, and then mitigates this bias through inference-time attention-scaling interventions on these target layers. This approach effectively reduces bias across three datasets (BBQ, CrowS-Pairs, and WinoGender) and four models, while incurring only an 0.82% increase in perplexity.
Attribution Methods in NLP: Navigating a Fragmented Landscape: This paper presents a comprehensive survey and systematic comparison of attribution methods in NLP. Addressing the issue of fragmented evaluation metrics and lack of fair comparisons in this field, it proposes a unified evaluation framework and reveals the applicability dynamics of different attribution methods across various tasks and model architectures.
AutoExp: Automatic Experiment Design and Execution by LLMs: This paper proposes the AutoExp framework, which leverages LLMs as intelligent agents to automatically complete the entire workflow of NLP experiments—from research question analysis, experimental design, and code generation/execution to result analysis and interpretation. It demonstrates the feasibility and limitations of LLM-based automated scientific experimentation across multiple standard NLP research scenarios.
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs: This work proposes the AutoGUI automatic annotation pipeline. By simulating interactions to compare UI state changes, inferring element functionality with LLMs, and performing dual-LLM verification and filtering, it constructs a high-quality UI functionality dataset of 704K annotations. The annotation accuracy of 96.7% is comparable to humans, significantly enhancing VLM UI grounding capabilities and demonstrating clear data scaling effects.
LLM-AT: Automatic Transmission for LLM Tiers Optimizing Cost and Accuracy: This paper proposes the LLM-AT framework, a training-free iterative pipeline consisting of Starter (an accuracy estimator based on historical inference logs that selects the initial LLM tier) $\rightarrow$ Generator (generates answers) $\rightarrow$ Judge (evaluates validity, automatically upgrading to a higher tier if invalid). On the MATH dataset, LLM-AT achieves near-o1 accuracy at only 59.37% of the cost of a single o1 inference, and achieves comparable performance on MCQA at only 12% of the o1 cost.
Awes, Laws, and Flaws From Today's LLM Research: A 14-dimensional annotation and statistical analysis of 2,054 LLM research papers (2020–2024) citing GPT-3/GPT-4 reveals a systematic methodological degradation in the field—only 25% of papers contain statistical tests, the proportion of ethics statements continues to decline, and LLM-as-a-judge approaches have surged by 15% despite lacking meta-evaluation. Meanwhile, the study empirically validates that mandatory conference checklists (such as ACL's limitations requirements) have effectively curbed this decline.
AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents: This paper proposes the AXIS framework, which enables LLM Agents to complete application tasks by prioritizing API calls over simulating human UI actions. In Microsoft Word experiments, AXIS reduces task completion time by 65-70% and cognitive load by 38-53%, while maintaining an accuracy rate of 97-98%.
OPTS: Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers: This work proposes OPTS, the first explicit selection mechanism for prompt design strategies, modeling 11 strategies (such as CoT, role-playing, and emotional prompts) as arms of a multi-armed bandit. By using Thompson sampling to automatically select the most suitable strategy and integrating it into the EvoPrompt optimizer, OPTS achieves up to a 50% performance improvement using GPT-4o mini across 23 tasks of BIG-Bench Hard.
Binary Classifier Optimization for Large Language Model Alignment: Proposes Binary Classifier Optimization (BCO), mathematically proving that the binary cross-entropy (BCE) loss is an upper bound of the DPO loss. This theoretical link enables LLM alignment using only "thumbs-up/down" binary feedback instead of pairwise preference data. By introducing a novel reward shift technique to tighten the upper bound, BCO performs comparably to DPO on paired preference datasets and outperforms both DPO and KTO on real-world Likert-5 annotated data.
Behavioral Analysis of Information Salience in Large Language Models: An interpretable analysis framework is proposed to systematically derive and investigate the concept of information salience internalized by LLMs, using length-controlled summarization behavioral probes and tracking the answerability of Questions Under Discussion (QUD). The study reveals that LLMs possess a hierarchical and consistent notion of salience, which, however, cannot be accessed via self-introspection and is only weakly correlated with human perception.
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models: Proposed BehaviorBox, which utilizes performance-aware contextual embeddings to automatically discover fine-grained performance difference features between two language models, such as specific contextual patterns like the subjunctive "were" in conditional mood or exclamation marks after emotional sentences.
Better Embeddings with Coupled Adam: Theoretically proves that the token-wise second moment of the Adam optimizer is the root cause of word embedding anisotropy (mean shift) in LLMs, and proposes Coupled Adam—which averages the second moment of the embedding layer across the vocabulary—eliminating the anisotropy issue and improving embedding quality and downstream performance in large-scale experiments.
Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals' Subjective Text Perceptions: This paper systematically investigates whether LLMs can predict individual annotators' subjective text perceptions using sociodemographic attributes (age, gender, education, race/ethnicity). The authors find that performance improvements after fine-tuning primarily result from learning individual annotator behaviors rather than sociodemographic patterns, raising doubts about the feasibility of using LLMs to simulate sociodemographic variations.
Beyond Dialogue: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model: The Beyond Dialogue framework is proposed to eliminate the bias between profiles and dialogues in role-playing training through Profile-Dialogue alignment, and introduces a sentence-level fine-grained alignment task to help models better understand and perform character traits.
Beyond In-Context Learning: Aligning Long-form Generation of LLMs via Task-Inherent Attribute Guidelines: Proves both theoretically and experimentally that ICL exemplars fail to fully transfer linguistic and formatting attributes of a task. It proposes the LongGuide algorithm to automatically learn Metric Guidelines (MG) and Output Constraint Guidelines (OCG) from a small amount of training data, achieving an average improvement of over 5% ROUGE-L across 7 long-form generation tasks.
Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning: Proposes Bidirectional Alignment (BiAlign). Building upon traditional knowledge distillation that aligns only output distributions, it introduces input preference alignment. By utilizing ranking loss, the student model learns the teacher model's preference ranking over different ICL exemplars. BiAlign consistently outperforms baselines across five language understanding, reasoning, and coding tasks, yielding a 20% improvement on GSM8K and an 18% improvement on LogiQA.
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs: This paper proposes CharacterBot, which learns the linguistic style and deep cognitive patterns of Lu Xun from his 17 essay collections through four training tasks (Author Perspective Reconstruction pre-training + multiple-choice/generative QA/style transfer fine-tuning) paired with the CharLoRA parameter update mechanism, significantly outperforming various baselines in linguistic accuracy and viewpoint understanding.
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms: This work proposes STA (Steering Target Atoms), which leverages Sparse Autoencoders (SAEs) to disentangle LLM representations into atomic knowledge components. By filtering and manipulating target atoms based on activation magnitude and frequency, STA achieves more robust and fine-grained behavior control compared to prompt engineering. It outperforms existing steering methods on both safety detoxification and reasoning control tasks.
BFS-Prover: Scalable Best-First Tree Search for LLM-Based Automatic Theorem Proving: This paper challenges the conventional wisdom that "automatic theorem proving requires complex search methods (such as MCTS or value functions)" by proposing the BFS-Prover system. Through three key innovations (expert iteration with data filtering, DPO based on compiler feedback, and length normalization), it scales simple best-first search (BFS) into a high-performance theorem prover, achieving a state-of-the-art (SOTA) score of 72.95% on the MiniF2F test set.
Bias in Language Models: Beyond Trick Tests and Towards RUTEd Evaluation: By comparing standard bias benchmarks ("trick tests") with the scenario-based RUTEd evaluation, this work reveals a lack of significant correlation between standard bias benchmarks and bias manifestations in realistic application scenarios, advocating for application-specific bias evaluation.
Biased LLMs Can Influence Political Decision-Making: Through two large-scale interactive experiments (N=299), this paper provides the first empirical evidence that LLMs with partisan biases can significantly influence human political opinions and budget allocation decisions. Crucially, this influence transcends partisan lines—Democrats can be persuaded by LLMs with conservative biases, and Republicans can likewise be influenced by LLMs with liberal biases.
BIG-Bench Extra Hard: To address the issue of BIG-Bench Hard being saturated by state-of-the-art models, Google DeepMind introduces BIG-Bench Extra Hard (BBEH), replacing the corresponding tasks in BBH with 23 significantly harder tasks. The strongest general model achieves only 9.8% (harmonic mean) while the strongest reasoning model reaches 44.8%, revealing a huge gap in LLMs' general reasoning capabilities.
Big5-Chat: Shaping LLM Personalities Through Training on Human-Grounded Data: This paper proposes the Big5-Chat dataset (100k dialogues), embedding real human Big Five personality traits into LLMs through SFT and DPO training methods. This approach significantly outperforms prompting-based methods. Additionally, the paper reveals that personality configurations of high conscientiousness/agreeableness and low extraversion/neuroticism can enhance the model's reasoning capabilities.
Bilingual Zero-Shot Stance Detection: To address the cross-lingual challenges in zero-shot stance detection, this paper proposes a bilingual joint framework. By constructing a shared semantic space and enabling cross-lingual knowledge transfer, the framework accurately determines text stance (favor/against/neutral) toward specific targets without labeled data in the target language.
BIPro: Zero-shot Chinese Poem Generation via Block Inverse Prompting Constrained Generation Framework: This paper proposes the BIPro framework, which leverages the infilling capability of Block Generative Models. Through two block inverse prompting methods, "revise" and "rewrite", it enables the weaker GLM-10B model to outperform GPT-4 and top-performing domain-specific systems in open-ended classical Chinese poetry generation without the need for domain-specific training.
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs: This paper proposes the Bitnet.cpp inference system, which achieves efficient and lossless inference of ternary LLMs (such as BitNet b1.58) on edge devices through two innovative mixed-precision matrix multiplication (mpGEMM) kernels: the element-level lookup table (TL) and the Int2+Scale-based (I2_S) kernels, speeding up inference by up to 6.25x compared to full-precision baselines and up to 2.32x compared to low-bit baselines.
Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning: This paper proposes K-MSE (Knowledge-enhanced Molecular Structure Elucidation), a framework that constructs a molecular substructure knowledge base to expand the chemical structure space coverage of LLMs, designs a specialized molecule-spectrum scorer to replace self-evaluation by LLMs, and incorporates Monte Carlo Tree Search (MCTS) to achieve test-time inference scaling. On the MolPuzzle benchmark, it improves the accuracy of GPT-4o-mini and GPT-4o from 3.7% and 27.8% to 27.3% and 39.8%, respectively.
Brevity is the soul of sustainability: Characterizing LLM response lengths: This work systematically studies the response length behavior of 12 LLMs across 5 datasets. It finds that LLMs widely generate excessively verbose responses (with core answers accounting for only 42%). It proposes various prompting strategies that reduce response length by 25-88% and inference energy consumption by 25-60%, while maintaining or even improving the ROUGE-L F1 quality.
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages: This paper constructs BRIGHTER, a multi-label emotion annotation dataset covering 28 languages, with a focus on low-resource languages in Africa, Asia, Eastern Europe, and Latin America. Annotated by native speakers, it establishes benchmark experimental results on both monolingual and cross-lingual emotion recognition tasks.
Can Large Language Models Understand Internet Buzzwords Through User-Generated Content: This paper constructs the first Chinese internet buzzword dataset, Cheer (1,127 instances), and proposes the Ress method, which guides LLMs to generate more accurate buzzword definitions from user-generated content by simulating the six-dimensional comprehension skills of child language acquisition, improving semantic accuracy by an average of 2.51%.
C²LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation: Proposes C²LEVA, a comprehensive Chinese-English bilingual evaluation benchmark containing 22 tasks. It systematically prevents data contamination through fully automated test data renewal and data protection mechanisms, demonstrating its effectiveness across 15 open-source and closed-source models.
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models: This paper proposes a framework for automatically generating CAD modeling sequences from text descriptions, comprising a semi-automated annotation pipeline, a dual-channel Transformer generator TCADGen, and an LLM enhancement module CADLLM. It improves CAD command accuracy from 84% to 96.6% and reduces Chamfer Distance from 120.99 to 3.12.
Can Input Attributions Explain Inductive Reasoning in In-Context Learning?: A controlled benchmarking of synthetic inductive reasoning tasks is designed to evaluate the capability of 4 input attribution methods in explaining ICL. The results show that the simplest gradient norm often performs best, yet all methods exhibit inconsistent and unstable performance across various tasks and model scales—indicating that ICL interpretability is more challenging than expected.
Can Language Models Reason about Individualistic Human Values and Preferences?: This work proposes the paradigm of "individualistic alignment" and constructs the IndieValueCatalog dataset (based on real-world data from 93k individuals in the World Values Survey, WVS). It evaluates and trains language models to reason about an individual's value judgments in novel scenarios based on their stated value expressions, revealing that frontier LMs achieve only 55%-65% accuracy and exhibit significant inequity across demographic groups.
Can Language Models Replace Programmers for Coding? RepoCod Says 'Not Yet': RepoCod is constructed—a benchmark containing 980 complex code generation tasks from 11 large-scale Python projects, featuring real repository-level dependencies and an average of 314 developer test cases. It reveals that even the most advanced LLMs achieve less than 30% Pass@1, falling far short of replacing programmers in real-world coding tasks.
Can Large Language Models Accurately Generate Answer Keys for Health-related Questions?: This paper explores using LLMs to automatically generate answer keys (information nuggets) for medical QA. By comparing various generation methods with human expert annotations, the study finds that the few-shot answer-based extraction method performs best. However, the capability of LLMs to extract atomic facts remains limited, with Llama 3.3 showing the best performance.
Can Large Language Models Address Open-Target Stance Detection?: This paper proposes the Open-Target Stance Detection (OTSD) task—where targets are unseen during training and not provided as input. It systematically evaluates the performance of 8 LLMs across four families (GPT, Gemini, LLaMA, Mistral) in both target generation and stance detection stages, finding that LLMs generally outperform existing TSE methods, but their performance drops significantly when targets are non-explicit.
Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions: This paper investigates the capacity of LLMs to handle direct knowledge questions and "loaded questions" embedded with false premises in the political domain. It evaluates whether LLMs can actively perform conversational grounding to correct users' false beliefs, revealing significant deficiencies in their ability to refuse false presuppositions and maintain factual accuracy.
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs: This paper proposes a semi-automated literature analysis pipeline that utilizes LLMs to automatically extract experimental results from arXiv papers to construct a continuously updatable dataset, LLMEvalDB (comprising $18,127$ records across $1,737$ papers). Leveraging this dataset, the authors replicate and extend key findings regarding the effectiveness of CoT and ICL prompting strategies across different task types.
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers: Proposes the LimitGen benchmark to systematically evaluate the capability of LLMs in identifying limitations of scientific research papers. It includes a synthetic dataset (created via controlled perturbations) and a human-annotated dataset (from ICLR 2025 reviews), and enhances LLMs' ability to generate more specific and constructive feedback through RAG-augmented literature retrieval.
Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs: This study systematically evaluates the capability of LLMs to leverage Abstract Meaning Representation (AMR) in downstream tasks. It is found that AMR-augmented prompts significantly improve the zero-shot performance of Llama 3.1 on long-context tasks such as dialogue summarization (raising cosine similarity from 66% to 76%), whereas they typically degrade performance on short-context tasks.
Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs: Systematically evaluates the capability of LLMs to interpret and leverage Abstract Meaning Representation (AMR), finding that AMR-augmented prompting significantly improves performance in long-context tasks such as dialogue summarization (e.g., Llama 3.1 zero-shot cosine similarity increases from 66% to 76%), though it typically degrades performance in short-context tasks.
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference: This paper proposes the FormalBench benchmark to systematically evaluate the program semantics reasoning capabilities of LLMs through formal specification inference tasks. It finds that while LLMs perform well on simple control flows, they struggle with complex structures like loops. Additionally, self-repair prompts are designed to improve the success rate by 25%.
Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases: This paper evaluates the ability of LLMs to simulate non-native English speakers' (L2 learners) dialogue. Through information-theoretic and distribution density metrics, the authors analyze whether LLM-generated L2 English can replicate the native-language-dependent biases (such as tense consistency errors, avoidance behavior, etc.) of human L2 learners, finding that modern LLMs can indeed replicate certain L1-dependent patterns.

Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs

Can Third-parties Read Our Emotions?: Through human subject experiments, this study systematically compares the alignment between third-party annotators (human annotators and LLMs) and the first-party (author self-annotations) in emotion recognition tasks. It reveals a significant gap between third-party annotations and authors' actual emotions. Although LLMs outperform human annotators, their performance remains suboptimal. Demographic similarity is shown to improve annotation quality.
Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks: This paper proposes the CR-Planner framework, which guides the planning of reasoning and retrieval processes via a fine-tuned critic model. By utilizing Monte Carlo Tree Search (MCTS) to train the critic, the framework significantly outperforms baseline methods on competitive programming, theorem-proving mathematical reasoning, and complex domain retrieval tasks.
Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method: Proposes ARM (Alignment-oriented Retrieval Method), which integrates three modules—information alignment (N-gram constrained decoding), structure alignment (MIP solver to reason about relationships between data objects), and self-verification aggregation—into the LLM decoding process to retrieve all required data objects "all at once." It significantly outperforms standard RAG (up to +5.2pt) and agentic RAG/ReAct (up to +19.3pt) on the Bird and OTT-QA datasets.
Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation: This paper proposes the MindVoyager framework, which evaluates the exploratory capabilities of LLM therapists by constructing client simulators with dynamic metacognition and openness, addressing the issue of overly "cooperative" client simulators in existing evaluation methods.
Catching Shortcuts: A Framework for Evaluating Shortcuts in Large Language Models: This paper proposes a systematic framework to detect and evaluate the phenomenon of shortcut learning in large language models (LLMs). By constructing contrastive test sets and diagnostic metrics, it reveals that LLMs rely on spurious correlations rather than true semantic understanding across multiple NLP tasks.
CER: Confidence Enhanced Reasoning in LLMs: Proposes the Confidence Enhanced Reasoning (CER) framework, which quantifies the confidence of key tokens (numerical values in math tasks or proper nouns in open-domain tasks) in each intermediate step of CoT reasoning. It evaluates the reliability of the entire reasoning chain using the product of step-wise confidence and replaces simple majority voting with confidence-weighted aggregation, achieving improvements over self-consistency by up to 7.4% on math and 5.8% on open-domain tasks.
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning: Proposes TIPA (Token Internal Position Awareness), a method that designes reverse character prediction training on the tokenizer vocabulary to enhance LLMs' perception of the internal character structure and positions within tokens, significantly improving performance on character-level tasks like Chinese Spelling Correction.
ChartLens: Fine-Grained Visual Attribution in Charts: Proposes the task of Post-Hoc Fine-grained Visual Attribution in charts, designs the ChartLens algorithm that leverages segmentation techniques to label chart elements and prompt multimodal LLMs with a Set-of-Marks for precise attribution, and constructs the ChartVA-Eval benchmark, achieving an F1 improvement of 26-66% across three chart types.
Cheaper and Better Diffusion Language Model via Task-Specific Training: This paper proposes to optimize diffusion language models through task-specific training strategies, significantly reducing training and inference costs while maintaining generation quality, making diffusion models more practical for text generation tasks.
ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events: This paper proposes the ChronoSense benchmark, which is the first to fully cover all 13 temporal relations of Allen's interval algebra and introduce three types of temporal arithmetic tasks. Through a systematic evaluation of seven LLMs under 0-shot, few-shot, and CoT settings, it reveals that models' temporal understanding capabilities are generally weak and heavily rely on pre-training memory.
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models: By identifying circuits for 10 compositional string editing operations on the PCFG SET dataset, the modular relationships between functionally related circuits in Transformers are investigated. It is found that functionally similar circuits exhibit significant node overlap and cross-task fidelity, and that circuits can be combined via set operations (union) to represent more complex functions beyond the capability of individual circuits.
Circuit Stability Characterizes Language Model Generalization: This paper proposes "circuit stability" as a novel approach to evaluate the generalization capabilities of language models. By mathematically formalizing soft circuits and circuit equivalence, it demonstrates across three case studies (arithmetic reasoning, boolean expressions, and sports understanding) that circuit stability can predict and characterize generalization behavior.
Classifying Unreliable Narrators with Large Language Models: Drawing on literary narratology theory, this work defines three different levels of unreliable narrators (intra-narrational / inter-narrational / inter-textual), constructs an expert-annotated dataset TUNa, and systematically evaluates the performance of LLMs on the task of classifying unreliable narrators.
Clue Guided Re-Assessment to Improve Reasoning in Large Language Models: This paper proposes the "Clue Guided Re-Assessment" method, which extracts key clues during the LLM reasoning process and guides the model to reflect on and correct its initial reasoning, significantly improving the accuracy of multi-step reasoning tasks.
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision: This paper proposes CodeTool, a step-by-step code generation framework that guides LLMs to select the optimal tool invocation path through two process reward mechanisms: On-the-spot Reward and Latent Reward. It significantly outperforms existing methods on StableToolBench and RestBench-TMDB.
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models: Drawing inspiration from circumstantial evidence standards in the legal domain, this paper proposes a hierarchical evaluation framework and the CogniBench dataset, systematically defining and evaluating the cognitive faithfulness of LLMs in cognitive statements (reasoning, evaluation, explanation) for the first time, and training the CogniDet detector to achieve simultaneous detection of factual and cognitive hallucinations.
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models: Leveraging eye-tracking data from cognitive science to analyze the behaviors of LLM layers, it is discovered that the middle layers exhibit the highest correlation with human gaze and are most suitable for semantic intervention. The CogSteer framework is proposed—fine-tuning only the single optimal layer (approximately 3% of parameters) achieves or exceeds the performance of full-layer fine-tuning, demonstrating effectiveness in GLUE and detoxification tasks.
Enough Coin Flips Can Make LLMs Act Bayesian: Through the controlled stochastic process of coin flipping, this work systematically investigates whether LLMs perform Bayesian inference in in-context learning. The study reveals that LLMs typically possess biased priors, but as contextual evidence increases, they correct their posterior estimates in an approximate Bayesian update manner. The deviation primarily stems from poorly calibrated priors rather than a flawed updating mechanism.
Collaborative Performance Prediction for Large Language Models: This paper proposes a Collaborative Performance Prediction (CPP) framework that leverages the historical performance of multiple LLMs across multiple tasks alongside design factors of models/tasks to perform collaborative-filtering-style prediction, overcoming the limitation of traditional Scaling Laws restricted to single-family prediction and accurately forecasting downstream performance across different model families.
Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation: A hybrid NMT and LLM translation scheduling strategy (PPLT and JDM) based on source sentence features is proposed. It maintains optimal translation quality while reducing the LLM invocation rate to approximately 25-30%, significantly decreasing computational overhead.
ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development: This paper proposes ComfyUI-Copilot, an LLM-based hierarchical multi-agent framework. Served as a ComfyUI plugin, it provides intelligent node/model recommendations and one-click workflow generation. Powered by a knowledge base covering 7K nodes, 62K models, and 9K workflows, it has served over 19K users across 22 countries and processed more than 85K queries online.
Commonsense Reasoning in Arab Culture: This paper proposes the ArabCulture dataset (3,482 MSA questions covering 13 Arab countries, 4 regions, and 54 cultural subdomains) to systematically evaluate the Arabic cultural commonsense reasoning capabilities of multiple LLMs. The results show that even GPT-4o only achieves 90%, while most models score between 40% and 80%, revealing significant deficiencies of LLMs in understanding non-Western cultures.
Comparing Large Language Models in Extracting Subjective Information from Political News: This paper systematically compares the capabilities of various large language models in extracting subjective information (sentiment inclination, stance, bias, framing effects, etc.) from political news, finding that different LLMs exhibit significant performance variations across different dimensions of subjective information extraction, while revealing the impact of the LLMs' inherent political biases on the extraction results.
Comparing Linguistic Acceptability Judgments of Autoregressive Language Models: This paper compares the performance of various autoregressive language models (such as the GPT and Llama families) on linguistic acceptability judgment tasks. Through systematic experiments, it reveals the impact of model scale, training data, and architecture on grammatical judgment capabilities, and discusses whether the models' grammatical knowledge aligns with human linguistic intuition.
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability: This paper proposes the Ordered CommonGen benchmark, which requires LLMs to generate sentences containing all given concepts in a specified order, thereby evaluating both compositional generalization and instruction-following capabilities. Evaluating 36 LLMs reveals that even the strongest model achieves an ordered coverage rate of only approximately 75%.
Computation Mechanism Behind LLM Position Generalization: Reveals that LLM attention logits learn an approximate arithmetic additive disentanglement of positional correlation and semantic importance ($W_{i,j} \approx f(\mathbf{q}, i-j) + g(\mathbf{q}, \mathbf{k})$ with a linear correlation of 0.959). It discovers the intermediate representation patterns that enable this disentanglement and uses them to explain LLMs' tolerance to positional permutation and their length generalization capabilities.
ConceptCarve: Dynamic Realization of Evidence: Proposes ConceptCarve, a framework that utilizes LLMs to dynamically construct concept trees to represent how evidence is concretely realized across different communities, significantly outperforming traditional retrieval systems in handling inferential gaps and domain sensitivity.
How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian: By constructing the first Italian subordinate-level psycholinguistic dataset (covering 187 basic categories), this work systematically compares the category organization structures of humans and LLMs at the subordinate concept level, finding low alignment overall but significant variation across different semantic domains.
Concreteness Versus Abstractness: A Selectivity Analysis in LLMs: This paper investigates the difference in how Large Language Models (LLMs) process concrete concepts (e.g., "apple") and abstract concepts (e.g., "freedom"). Through selectivity analysis, the authors discover that subpopulations of neurons in LLMs selectively respond to concreteness or abstractness, revealing an intriguing correspondence between the semantic representations of LLMs and the "concreteness effect" in human cognitive theories.
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement: Condor proposes a two-stage synthetic data generation framework that constructs diverse tag-driven questions via a World Knowledge Tree and iteratively optimizes response quality using Self-Reflection Refinement. With only 20K synthetic samples, the base model outperforms rivals of similar sizes on dialogue alignment tasks, and the effectiveness of iterative self-refinement is validated on models up to 72B.
Conformity in Large Language Models: This paper adapts the classic Asch conformity experiment paradigm from psychology to LLMs to systematically study their conformity behaviors. It reveals that all evaluated models alter their answers under the influence of majority opinions, showing a higher susceptibility to conformity when uncertainty is greater. Furthermore, the paper proposes two intervention methods, Devil's Advocate and Question Distillation, which effectively mitigate this conformity effect.
ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability: ConSim proposes using LLMs as "simulators" to automatically evaluate the effectiveness of concept-based explanations. By testing whether an LLM can predict the explained model's output solely based on the provided explanations, ConSim simultaneously measures the quality of the concept space and the comprehensibility of the explanations, achieving a scalable, consistent, and comprehensive evaluation of explanation methods.
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities: ConsistencyChecker proposes a reference-free LLM evaluation framework based on self-consistency trees. By constructing reversible tree-like multi-step paths (such as multilingual round-trip translation and equivalent code rewriting), it quantifies the model's ability to maintain semantics or functionality during iterative transformations. Dynamically generating benchmarks eliminates data leakage at its root, and the framework achieves a correlation of $r > 0.7$ with the authoritative WMT 2024 rankings, proving that LLM generalization capabilities can be reliably evaluated without paired data.
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models: This paper proposes a framework based on Prototype Contrastive Perplexity (CP). By constructing positive and negative sample pairs that are semantically similar but possess different toxic attributes, and performing contrastive learning in the perplexity space to fine-tune LLMs, the framework achieves a significant reduction in toxicity (Mistral-7b toxicity drops from 33.1% to 4.3%) while having almost no impact on downstream task performance.
Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering: This paper proposes Contrastive Prompting (CP), an inference-time method that constructs an auxiliary prompt to encode non-core information of a sentence. By performing "semantic subtraction" between the hidden layer activations of the normal prompt and the auxiliary prompt during inference, it filters out irrelevant semantics like stop words, focusing the LLM sentence embeddings more on core semantics. This plug-and-play approach consistently improves the performance of various prompting methods (such as PromptEOL, CoT, and Knowledge) on STS and classification tasks.
Controlling Politeness in Multi-Turn Dialogues Through Pre-Phrase Augmentation: This paper proposes a method based on Pre-Phrase Augmentation, which automatically adds politeness-regulating prefixes during the dialogue generation process to achieve fine-grained politeness control in multi-turn dialogues while maintaining the coherence and informational integrity of the dialogue content.
Conversational Quality Assessment: A Large-Scale Corpus and Comprehensive Study: This paper constructs a large-scale, multi-dimensional conversational quality assessment corpus covering multiple quality dimensions such as fluency, consistency, informativeness, and engagingness. Based on this corpus, a comprehensive benchmark and analysis of existing conversational evaluation methods are conducted.
Convert Language Model into a Value-based Strategic Planner: This paper proposes the straQ* framework, which reformulates the next-token prediction of LLMs into next-strategy prediction. By training the LLM as a strategy-level Q-network using the Bellman equation, the framework plans the optimal supportive strategy based on long-term returns in Emotional Support Conversation (ESC). It serves as a plug-and-play, lightweight planner to guide dialogue LLMs to generate high-quality responses.
Cool-Fusion: Fuse Large Language Models without Training: This work proposes Cool-Fusion, a training-free method to fuse heterogeneous LLMs. By enabling multiple models to evaluate and rerank generated content at the text segment granularity, it achieves a 17.4% accuracy improvement over the strongest source model on GSM8K.
Cooperating and Competing Through Natural Language: This paper investigates the cooperative and competitive behaviors of LLM agents in natural language interaction environments. By designing multi-player game scenarios, it analyzes how linguistic strategies (such as persuasion, deception, and negotiation) affect game outcomes, revealing the emergent strategic capabilities and limitations of LLMs in social interactions.
COSMIC: Generalized Refusal Direction Identification in LLM Activations: This paper proposes the COSMIC framework, which leverages cosine similarity in the activation space to automatically select refusal guidance directions. It operates entirely without relying on model output tokens or predefined refusal templates. COSMIC matches the performance of existing methods under standard settings, and is the first to successfully extract effective refusal directions in scenarios of adversarial complete refusal and weakly aligned models.
Cross-Modal Alignment for LLM-Enhanced Spoken Language Understanding: This paper proposes a cross-modal alignment framework that achieves LLM-enhanced spoken language understanding (SLU) by explicitly aligning speech representations with the textual semantic space of LLMs, obtaining SOTA performance on intent detection and slot filling tasks.
Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts: Proposes the L-Cross Modulation method, which transfers concept steering vectors (SVs) from one LLM to another via simple linear transformations to achieve behavioral control. Three key findings are identified: (1) cross-model SV transfer is effective; (2) different concepts share the same transformation matrix; (3) SVs of smaller models can control larger models (weak-to-strong transfer).
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest: This paper proposes the Next Tokens Extraction (NTE) paradigm, converting next-token prediction in LLM pre-training data into a BIO-tagged extraction task. By pre-training a RoBERTa tagger (Cuckoo) on 102.6 million instances derived from C4 and TuluV3, it comprehensively outperforms existing IE pre-training models in few-shot information extraction tasks.
Cultural Learning-Based Culture Adaptation of Language Models: This paper proposes the CLCA framework, which draws on cultural learning theory to generate culturally adapted dialogue data through simulated social interactions. By combining this with intention understanding for multi-task training, the framework significantly improves the cultural value alignment of various LLMs on the World Values Survey.
Culture is Not Trivia: Sociocultural Theory for Cultural NLP: Starting from sociocultural linguistic theory, this paper points out the methodological limitations of current cultural NLP (coarse-grained national boundaries, static benchmarks, and the lack of a unified definition of culture), argues that culture is a dynamically constructed process rather than static knowledge, and proposes "localization" as a more viable research framework.
DeAL: Decoding-time Alignment for Large Language Models: DeAL reformulates the LLM alignment problem as a heuristic search problem during decoding. It utilizes customizable reward functions (including programmatic constraints and parameterized reward models) to guide token selection during the inference phase, achieving flexible multi-objective alignment that can complement and stack with RLHF.
When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models: A computational framework combining LLM word-level metaphor detection with SBERT document-level semantic association is proposed. Evaluating on 400,000 tweets about US immigration, it reveals a complex landscape where conservatives use more dehumanizing metaphors, but biological metaphors exert a stronger effect on user engagement among liberals.
Déjà Vu? Decoding Repeated Reading from Eye Movements: This work introduces for the first time the task of automatically decoding whether a reader has previously read a text based on their eye movement patterns. Using feature-based XGBoost and neural RoBERTEye models, it achieves ~70% accuracy in single-trial experiments and ~91% in pairwise trials. It also incorporates synthetic saccadic pathways generated by the E-Z Reader cognitive model as auxiliary reference signals to enhance predictions.
DenseLoRA: Dense Low-Rank Adaptation of Large Language Models: This paper proposes DenseLoRA, which introduces a cross-layer shared Encoder-Decoder for the joint compression and reconstruction of hidden representations. It replaces two redundant low-rank matrices in LoRA with a single small, dense low-rank matrix for adaptation. With only 0.01% of trainable parameters, it achieves 83.8% accuracy on LLaMA3-8B, surpassing LoRA which achieves 80.8% with 0.70% parameters.
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models: This paper reveals that LLMs exhibit "Deontological Keyword Bias" (DKB)—when prompts contain modal deontic expressions such as "must" and "ought to", the models misclassify over 90% of commonsense scenarios as obligations. The authors propose debiasing strategies based on few-shot examples and reasoning prompts.
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training: This study identifies the "refusal position bias" in standard safety fine-tuning data, where models only learn to refuse at the start of a response and fail to interrupt when realizing unsafety midway. The authors propose DeRTa (Decoupled Refusal Training), which includes MLE training with "harmful prefix + safe refusal" and RTO training that simulates "harmful-to-safe" transition at every position. It enables LLMs to refuse whenever they detect unsafety at any position in the response, outperforming GPT-4 and LLaMA3-Instruct across six attack scenarios.
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models: This paper models the detection of referring expressions in visual dialogue as an autoregressive token prediction task. Through parameter-efficient fine-tuning (QLoRA) of Llama 3.1-8B, the authors demonstrate that textual context alone is highly effective for detecting mention spans in visually grounded dialogues, achieving F1 scores of 0.90 and 0.94 on the AGOS and PhotoBook datasets, respectively.
DICE-Bench: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues: This work proposes DICE-Bench, a function-calling benchmark targeting multi-round, multi-party dialogue scenarios. It comprises 1,607 high-quality dialogue instances and the DICE-Score metric, which quantifies information dispersion, revealing the limitations of current LLMs in tool invocation during complex dialogues.
DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models: DiffLM proposes a controllable data synthesis framework based on VAE + Latent Diffusion + Frozen LLM Decoder. By introducing a diffusion process in the latent space to precisely model the real data distribution and injecting distribution information into the LLM via soft prompts, the framework achieves a synthesis quality that outperforms real data by 2%-7% across three types of structured data: tables, code, and tools.
Direct Confidence Alignment: Aligning Verbalized Confidence with Internal Confidence In Large Language Models: The paper proposes Direct Confidence Alignment (DCA), which utilizes DPO to align the verbalized confidence of LLMs with their internal token probability confidence, thereby improving the consistency and transparency of the model's confidence expressions.
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services: This work proposes DiSCo, a device-server collaborative LLM inference scheduler that optimizes user-perceived Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT) under cost constraints through cost-aware request dispatching and token-level migration mechanisms.
Disentangling Memory and Reasoning Ability in Large Language Models: It proposes explicitly decomposing the reasoning process of LLMs into "memory recall" and "logical reasoning" steps—introducing two learnable special tokens, <memory> and <reason>, to mark whether each step is knowledge recall or logical reasoning. After generating training data using a dual-LLM framework, the target LLM is fine-tuned using LoRA. This improves performance and enhances interpretability on StrategyQA, CommonsenseQA, and TruthfulQA, with the 8B model surpassing GPT-4o on TruthfulQA.
Diversity-oriented Data Augmentation with Large Language Models: This paper proposes the DoAug framework, which fine-tunes an LLM paraphraser using SFT+DPO, combined with coreset selection and diversity sampling. While maintaining semantic coherence, it significantly enhances the diversity of the augmented dataset, leading to an average performance improvement of 10.52% across 12 datasets, outperforming the sub-optimal baseline by 3.76 percentage points.
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs: Drawing from psychological overconfidence theories, this paper reveals that LLM confidence estimation is insensitive to task difficulty and vulnerable to role-playing biases (e.g., overconfidence in expert personas, underconfidence in female/Asian personas while actual accuracy remains unchanged). It proposes Answer-Free Confidence Estimation (AFCE), which decouples confidence estimation from answer generation, reducing the ECE of GPT-4o by 58.4% on high-difficulty tasks.
Do Language Models Understand Honorific Systems in Javanese?: This work constructs the first Javanese honorific corpus, Unggah-Ungguh (4,024 sentences covering four honorific levels), and systematically evaluates the capability of LLMs to understand the Javanese honorific system across four tasks: classification, style transfer, cross-lingual translation, and dialogue generation. The results reveal that even the strongest closed-source model (GPT-4o) achieves a zero-shot classification accuracy of only 53.5% and shows a severe bias toward specific honorific levels.
Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm: Through a systematic analysis of cognitive task performance across multiple LLMs using the N-back paradigm, this study reveals that poor performance is primarily caused by insufficient task comprehension and task-set maintenance failure rather than working memory capacity constraints. Supported by curriculum learning, the best model (Llama 3.1 70b) can even perform the 10-back task (accuracy of 84.75%).
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?: This work constructs the shortcut-free evaluation dataset SOCRATES to systematically evaluate the actual capabilities of 41 LLMs in latent multi-hop reasoning. The authors find that models can achieve up to an 80% composition rate on country-type bridge entities, but only about 5% on year-type bridge entities.
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?: Evaluating the "human-likeness" of 18 instruction-tuned LLMs in educational assessment from a psychometric perspective (Classical Test Theory, CTT and Item Response Theory, IRT), this study finds that even after temperature scaling calibration, LLM response distributions are inherently different from humans—large models are overconfident and fail to predict patterns of humans being attracted by distractors, suggesting that zero-shot LLMs are not suitable to replace humans in test piloting.
Does Time Have Its Place? Temporal Heads Where Language Models Recall Time-specific Information: Through EAP-IG circuit analysis, "Temporal Heads" specialized in processing time-conditional knowledge are discovered in Llama-2/Qwen/Phi-3. Ablating these heads selectively decreases the accuracy of temporal knowledge (by 3-9%) without affecting time-invariant knowledge or general QA. The study also demonstrates the feasibility of selective temporal knowledge editing by injecting temporal head activations.
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting: This work proposes Dolphin, a lightweight (322M) document image parsing model that adopts an "analyze-then-parse" two-stage paradigm. It first performs page-level layout analysis to generate an element sequence in reading order, and then utilizes heterogeneous anchor prompting to parse the content of each element in parallel. With only 322M parameters, it outperforms 7B+ models and commercial systems on both page-level and element-level parsing tasks.
Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation: Proposed DTA-Llama, which transforms sequential tool invocation paths of traditional tree search into Directed Acyclic Graph (DAG) structures to achieve parallel invocation. A Process/Thread inference framework is designed to enable the LLM to decompose tasks and execute multiple tools in parallel in each turn, allowing Llama2-7B to match the performance of GPT-3.5 Parallel Function Calling on StableToolBench.
Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models: A dynamic web knowledge retrieval framework is proposed to enhance the quality of LLM-generated counter-arguments. A new, moderately sized evaluation dataset (150 pairs) is constructed, and an LLM-as-a-Judge evaluation methodology is used to replace traditional reference-based metrics. Experimental results demonstrate that integrating external knowledge significantly improves the relevance, persuasiveness, and factuality of the generated content.
Dynamic Parallel Tree Search for Efficient LLM Reasoning: This work proposes the DPTS (Dynamic Parallel Tree Search) framework. It introduces a Parallelism Streamline to resolve the parallelization difficulty caused by frequent path switching in tree search. Additionally, its Search and Transition Mechanism incorporates early stopping and deep search strategies to reduce redundant exploration of low-confidence paths. DPTS achieves a $2\times$ to $4\times$ inference acceleration on Qwen-2.5 and Llama-3 while maintaining or exceeding the reasoning accuracy of existing algorithms like MCTS.
ECLM: Entity Level Language Model for Spoken Language Understanding with Chain of Intent: This paper proposes the ECLM framework to apply LLMs to multi-intent spoken language understanding. By converting token-level slot filling into an entity recognition task, it solves the sequence alignment issue. It introduces the "Chain of Intent" to achieve step-by-step multi-intent recognition, significantly outperforming SOTA baselines on MixATIS and MixSNIPS.
EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models: Proposes EdiText, a controllable text editing method based on latent diffusion models. It combines SDEdit for coarse-grained editing and self-conditioning for fine-grained editing, achieving multi-scale text editing control ranging from minor modifications to extensive rewriting.
Educators' Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting: Through a blind evaluation experiment, this paper tasks human annotators with teaching experience to comparatively evaluate LLM tutors and human tutors in the context of grade-school math word problems. The results demonstrate that LLMs are rated superior to human tutors across four dimensions: engagement, empathy, pedagogical scaffolding, and conciseness, with the empathy dimension being the most prominent—where 80% of the annotators prefer LLMs.
Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets: Proposed EnsembleLoRA—an efficient ensemble method for multi-dataset fine-tuning. It utilizes a first-order Taylor approximation to rapidly estimate task affinity for dataset grouping, trains one adapter per group, and combines them via weighted aggregation. This achieves a 10% average test accuracy improvement over QLoRA on 10 SuperGLUE tasks with only a 9% additional computational cost.
ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations: Constructs the ELI-Why benchmark containing 13.4K "Why" questions. Through two human studies, it is found that only 50% of GPT-4-generated explanations tailored to different educational levels match the targeted grade levels (compared to 79% for manually curated ones), and they satisfy learners' information needs 20% less than human answers.
Enabling LLM Knowledge Analysis via Extensive Materialization: This paper proposes a methodology to materialize the factual knowledge of LLMs into a knowledge base on a large scale through recursive querying and result consolidation. Leveraging this, the authors construct GPTKB, which contains 101 million triples and 2.9 million entities, and conduct the first comprehensive analysis of the scale, accuracy, bias, timeliness, and consistency of GPT-4o-mini's knowledge.
Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction: This paper explores the feasibility of extracting Theory of Mind (ToM) information from the internal representations of open-source LLMs (LLaMA). It leverages the BDI (Belief-Desire-Intention) framework to manipulate these representations for generating dialogue responses aligned with human social cognition, achieving win rates of 67% and 63% for ToM-aligned 3B and 8B models, respectively.

Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding

Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues: This paper proposes encoding gesture sequence data (3D human motion data) into discrete gesture tokens via a VQ-VAE, and then mapping them to the input space of a language model through feature alignment to enhance spoken discourse modeling. The complementary value of gesture information on spoken discourse understanding is validated through text-filling tasks across three types of discourse markers (discourse connectives, quantifiers, and stance markers).
Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery: The ECO-Concept framework is proposed to automatically extract textual concepts via a slot attention mechanism and evaluate concept comprehensibility using an LLM as a human proxy. A comprehensibility feedback loss guides model fine-tuning, achieving concept explanations with both high classification accuracy and human comprehensibility without any concept annotations.
Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge: This paper proposes the STL-DivEn dataset (16K samples) and the KGST (Knowledge-Guided STL Translation) framework. By translating natural language to Signal Temporal Logic (STL) via a two-stage "generate-then-refine" pipeline, it achieves an STL Formula Accuracy of 0.5587, significantly outperforming GPT-4 (0.4733) and DeepSeek (0.4790).
EnigmaToM: Improve LLMs' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States: This paper proposes EnigmaToM, a neuro-symbolic framework that constructs a neural knowledge base of entity states (Enigma) to generate spatial scene graphs for belief tracking. Combined with a psychologically-inspired iterative masking mechanism for accurate perspective-taking, it significantly improves the Theory-of-Mind (ToM) reasoning capabilities of LLMs on three benchmarks (ToMi, HiToM, and FANToM), showing particularly outstanding performance in high-order reasoning scenarios.
Entity Framing and Role Portrayal in the News: This paper constructs a multilingual hierarchical entity framing corpus containing 5 languages, 1,378 news articles, and over 5,800 annotated entities. It proposes a narrative role classification system comprising 22 fine-grained roles (under three main frames: protagonist, antagonist, and innocent) and establishes baselines on fine-tuning multilingual Transformers and hierarchical zero-shot learning with LLMs.
Growing Through Experience: Scaling Episodic Grounding in Language Models: This paper proposes a weak-to-strong episodic grounding framework that collects structured experience data via MCTS, transfers the episodic grounding capabilities of smaller models to larger models through behavioral ratio distillation, and leverages DPO optimization to learn from both successful and failed experiences. This approach outperforms SOTA models, including GPT-4o, by 3.45% on physical planning tasks.
Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection: Proposes the ERM method, which enhances feedback quality by generating exemplars with detailed problem-solving processes guided by meta-prompts, and introduces Feedback Memory and Exemplar Factory as two long-term memory mechanisms to efficiently store and reuse historical feedback and exemplars, surpassing SOTA prompt optimization methods on multiple tasks with approximately half the optimization steps.
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents: This paper introduces EscapeBench, a benchmark for evaluating the creative intelligence of LLM Agents based on escape room games (36 scenarios, 3 difficulty levels). It reveals severe deficiencies in current models regarding creative tool use and implicit goal inference, and proposes EscapeAgent (incorporating Foresight + Reflection) to reduce prompt dependency by nearly 50%.
Evaluating Implicit Bias in Large Language Models by Attacking from a Psychometric Perspective: Drawing on three psychometric principles from cognitive and social psychology (goal shifting, cognitive concordance, and imitation learning), this paper designs three types of attacks (Disguise, Deception, and Teaching) to elicit implicit biases in LLMs. A bilingual benchmark, BUMBLE (comprising 12.7K entries across 9 categories of bias), is constructed, revealing that all mainstream LLMs exhibit systematic implicit biases that can be triggered.
Evaluating Language Models as Synthetic Data Generators: This work proposes AgoraBench, a benchmark to systematically evaluate the data generation capabilities of 6 LLMs across 3 domains $\times$ 3 data generation methods. By training 99 student models, the study reveals that the data generation capabilities of LLMs are not directly correlated with their problem-solving abilities; GPT-4o performs best in instance generation, while Claude-3.5-Sonnet excels in quality enhancement.
ExpeTrans: LLMs Are Experiential Transfer Learners: ExpeTrans proposes an autonomous experience transfer framework. It mimics human cognitive intelligence to automatically transfer problem-solving experience from existing source tasks to newly encountered target tasks. This framework effectively improves LLM performance across 13 datasets without requiring manual experience collection for every new task.
Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments: This paper proposes the "explain-then-process" paradigm for grammar prompting, where an LLM first generates an explanation of the target grammatical phenomenon, and this explanation is subsequently fed back to the target model (LLM or SLM) as context to assist in minimal-pair grammatical acceptability judgments. This paradigm significantly improves accuracy across three multilingual benchmarks (English BLiMP, Chinese SLING, and Russian RuBLiMP). Pairing SLMs with GP+CoT narrows the average LLM-SLM performance gap from 13.0 percentage points (pp) to 5.8pp (a 56% reduction).
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models: This paper proposes the ExpliCa dataset (4,800 questions containing causal and temporal connectives), which integrates causal and temporal relation evaluation for the first time along with crowdsourced human ratings. The study finds that even top-tier models struggle to exceed 0.80 accuracy, and models systematically misclassify temporal relations as causal relations.
Explicit and Implicit Data Augmentation for Social Event Detection: This paper proposes SED-Aug, a dual data augmentation framework combining explicit (LLM text augmentation) and implicit (feature space perturbation) methods for social event detection. It outperforms the strongest baselines by 17.67% and 15.57% in Average F1 on Twitter2012 and Twitter2018, respectively.
Exploring Explanations Improves the Robustness of In-Context Learning: This paper proposes the X²-ICL framework, which systematically explores the latent reasoning space by generating explanatory reasoning paths for all possible labels (rather than only the observed label) within the in-context examples. This significantly improves the robustness of ICL on out-of-distribution (OOD) data—across 8 OOD datasets on 5 LLMs, X²-ICL outperforms ICL and X-ICL in 6–8 datasets.
Exploring Graph Representations of Logical Forms for Language Modeling: Proposes GFoLDS, a graph Transformer language model pre-trained on DMRS logical form graph representations, and introduces the "Linguistic Knowledge Catalysis Hypothesis" (LKCH): logical form language models acquire fundamental linguistic phenomena almost immediately, thereby accelerating the learning of complex patterns and substantially outperforming BERT given the same volume of data.
HiCUPID: Exploring the Potential of LLMs as Personalized Assistants: Introduces HiCUPID—the first open-source benchmark that comprehensively addresses five key desiderata of personalized AI assistants (adhering to user profiles, understanding implicit information, multi-info reasoning, long-context modeling, and proactive responses). It contains 1,500 users, each with ~40 dialogues, corresponding QA pairs, and a Llama-3.2 automatic evaluation model.
Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations: This paper introduces the TimeStress dataset (521K statements, 2003 temporal facts) to evaluate the temporal robustness of factual knowledge in 18 LLMs under temporal context variations. The results find that even the best model achieves perfect robustness on only 11% of the facts, committing critical errors that humans would never make.
FEAT: A Preference Feedback Dataset through a Cost-Effective Auto-Generation and Labeling Framework for English AI Tutoring: Proposed the FEAT framework, which automatically generates and labels teacher feedback preference datasets using LLMs for English tutoring systems. The study finds that mixing in only 5-10% of human-annotated data can outperform the ranking performance achieved by using 100% human-annotated data.
FoodTaxo: Generating Food Taxonomies with Large Language Models: This paper proposes FoodTaxo, an iterative, bottom-up taxonomy generation and completion algorithm based on Llama-3. It utilizes a three-stage pipeline consisting of CoT prompting + RAG retrieval + NLI verification to incrementally construct hierarchical taxonomies starting from known leaf-node concepts. It is competitive with state-of-the-art (SOTA) methods such as TacoPrompt on five benchmark datasets, and also uncovers the fundamental bottleneck of placing non-leaf nodes through reference-free metrics and ablation studies.
From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts: This work is the first to directly investigate the relationship between the frequency of facts in pre-training data and an LLM's ability to recall them. It proposes two sample efficiency metrics and reveals that while models of different architectures and scales perform similarly on high-frequency facts, they differ significantly on low-frequency facts—making the ability to learn low-frequency facts the key differentiator of model sample efficiency.
From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment: A cross-lingual alignment evaluation framework, NeuronXA, is proposed based on neuron activation states. By utilizing FFN layer neuron states as translation-invariant internal representations to measure the cross-lingual alignment capability of multilingual LLMs, it achieves a Pearson correlation of 0.9556 with downstream task performance using only 100 parallel sentence pairs.
From Selection to Generation: A Survey of LLM-based Active Learning: This paper presents the first systematic survey of active learning (AL) in the LLM era. It proposes a taxonomy structured around two orthogonal axes: Querying (selection + generation) $\times$ Annotation (human + LLM + hybrid). It comprehensively details how LLMs replace or enhance traditional methods in each step of the five-stage AL loop and extends the discussion to four major LLM learning paradigms: ICL, SFT, RLHF, and knowledge distillation.
Game Development as Human-LLM Interaction: This paper proposes Chat Game Engine (ChatGE), an LLM-based conversational game engine that enables users to develop customized games through natural language interaction without programming knowledge. It designs a data synthesis pipeline and a three-stage progressive training strategy to transform a conversational model into a game engine.
GAMEBoT: Transparent Assessment of LLM Reasoning in Games: This paper proposes GAMEBoT, a game-based LLM reasoning evaluation platform. By decomposing complex in-game reasoning into predefined modular subproblems combined with rule-based ground-truth verification, GAMEBoT achieves transparent reasoning capability assessment across 17 mainstream LLMs.
Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models: This work proposes GAPO, a gradient-adaptive multi-objective policy optimization method. By combining the Multiple Gradient Descent Algorithm (MGDA) with gradient normalization, GAPO balances the trade-offs among conflicting objectives such as helpfulness and harmlessness in LLMs. Furthermore, P-GAPO is introduced to support user-preference-driven Pareto frontier generation.
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization: This work proposes the GAPO (Generative Adversarial Policy Optimization) framework, which integrates the adversarial training mechanism of GANs with PPO. By replacing the traditional decoder-only architecture with an encoder-only reward model, GAPO introduces a new paradigm called "Preferential Prompt" (modifying constraints in the prompts rather than the responses) to enhance the LLM's capability to understand and adhere to fine-grained constraints. It significantly outperforms baselines such as DPO, KTO, and SimPO on the IFEval and product description generation tasks.
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models: Proposed the Generative Psycho-Lexical Approach (GPLA) to automatically construct a five-factor value system for LLMs (Social Responsibility, Risk-Taking, Rule-Following, Self-Competence, and Rationality), outperforming the classical Schwartz human value system in structural validity, safety prediction, and value alignment.
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models: This paper proposes the Genetic-Instruct algorithm, which draws inspiration from the crossover and mutation operations of evolutionary algorithms to scale 512 seed instructions into over 7.5 million high-quality coding instructions. Implementing a three-role pipeline consisting of Instructor-LLM, Coder-LLM, and Judge-LLM, the trained models outperform Self-Instruct and Evol-Instruct on code generation benchmarks.
GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction: Proposes GenKnowSub (General Knowledge Subtraction), which trains a general knowledge LoRA on the Wikipedia corpus and subtracts it from task-specific LoRAs ($LoRA_{res}^i = LoRA_{ts}^i - LoRA_g$) to obtain purer residual modules. Combined with the Arrow routing algorithm to dynamically select the most relevant modules, it improves zero-shot transfer average accuracy by 1.6% on Phi-3, with even larger gains in cross-lingual scenarios (German +3.9%, French +3.6%).
Towards Geo-Culturally Grounded LLM Generations: This paper systematically evaluates the impact of two RAG strategies—Knowledge Base grounding (KB grounding) and search grounding—on the cultural awareness capabilities of LLMs. It finds that search grounding significantly improves propositional cultural knowledge but exacerbates stereotype risks, and neither strategy improves cultural fluency in human evaluations.
Geometric Signatures of Compositionality Across a Language Model's Lifetime: By linking the degree of dataset compositionality with the non-linear intrinsic dimension ($I_d$) and linear effective dimension ($d$) of language model representations, this work reveals a form-meaning dichotomy: non-linear $I_d$ encodes meaningful compositional semantic complexity, whereas linear $d$ encodes surface word-form complexity. This correspondence is established during training alongside the emergence of linguistic capabilities.
GIFT-SW: Gaussian Noise Injected Fine-Tuning of Salient Weights for LLMs: This paper proposes GIFT-SW, a novel parameter-efficient fine-tuning method. By updating only the "salient columns" of the weight matrices while injecting Gaussian noise into non-salient columns, GIFT-SW outperforms full-parameter fine-tuning and modern PEFT methods such as LoRA and DoRA under equivalent computational budgets.
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization: This paper proposes POUGH, an efficient universal goal hijacking method against LLMs using an efficient incremental optimization algorithm alongside two semantics-guided prompt organization strategies (sampling strategy + ordering strategy). It achieves an average attack success rate of 93.41% across four open-source LLMs and ten malicious target responses.
What Makes a Good Natural Language Prompt?: By conducting a meta-analysis of over 150 prompting papers, this work proposes an attribute-centric prompt quality evaluation framework consisting of 6 dimensions and 21 attributes. Empirical experiments on reasoning tasks show that single-attribute enhancement often outperforms multi-attribute combinations, and fine-tuning on attribute-enhanced data can further boost model reasoning capabilities.
GORP: Continual Gradient Low-Rank Projection Fine-Tuning for LLMs: GORP proposes to unify the gradients of full-rank parameters and LoRA low-rank parameters by projecting them into a low-rank gradient subspace for joint updates. By utilizing the first moment of Adam to implicitly construct a shared gradient space across tasks, it alleviates catastrophic forgetting. In continual学习 settings on T5 and LLaMA2, its performance is close to the multi-task joint training upper bound.
GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models: This paper provides the first systematic analysis of the Offsite-tuning problem from the perspective of optimization theory. It proposes the Gradient-preserving Compression Score (GCS) and designs the GradOT method. GradOT employs Dynamic Rank Decomposition (DRD) for MHA and Selective Channel Pruning (SCP) for MLP, simultaneously achieving performance preservation and privacy protection under training-free conditions.
Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning: Graph Counselor proposes a multi-agent collaborative GraphRAG reasoning framework. It adaptively extracts graph structural information through three agents (Planning/Thought/Execution) and introduces a multi-perspective self-reflection mechanism to correct reasoning biases, outperforming existing methods on multiple graph reasoning tasks.
Can Graph Descriptive Order Affect Solving Graph Problems with LLMs?: This paper presents the first systematic study on the impact of graph description orders (BFS, DFS, PageRank, PPR) when LLMs solve graph reasoning problems, demonstrating that ordered descriptions significantly outperform random ones, and different tasks prefer different permutation strategies.
HFT: Half Fine-Tuning for Large Language Models: This paper proposes Half Fine-Tuning (HFT), which randomly freezes half of the parameters and updates only the other half during each fine-tuning epoch. Without altering the model architecture, HFT significantly mitigates catastrophic forgetting while achieving comparable or even superior performance on downstream tasks compared to Full Fine-Tuning (FFT), reducing training time by approximately 30%.
Hierarchical Attention Generates Better Proofs: This paper proposes a Hierarchical Attention regularization method. By establishing a five-layer semantic hierarchy to guide the attention mechanism of LLMs, the approach aligns it with the natural information flow of mathematical reasoning. It improves proof success rates on miniF2F and ProofNet by 2.05% and 1.69%, respectively, while reducing proof complexity by 23.81% and 16.50%.
Hierarchical Retrieval with Evidence Curation for Open-Domain Financial QA: HiREC proposes a hierarchical retrieval and evidence curation framework that first retrieves relevant documents and then selects passages from them. It filters out irrelevant passages and automatically generates complementary queries to retrieve missing information. On the LOFin benchmark containing 145k SEC documents, it improves answer accuracy by over 13% compared to the strongest RAG baseline.
How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs: By constructing an Expert-in-the-Loop probing pipeline with three sets of cognitive linguistics experiments (truth-value judgment, word completion, and open-ended causal questioning) across 16 narratives, 30 prompt variants, and 7 LLMs, this study systematically evaluates LLMs' understanding of grammatical aspect (perfective vs. imperfective) in narratives. The results show that LLMs achieve only 18% accuracy under non-prototypical aspect conditions (compared to 71% for humans) and lack long-distance causal reasoning capabilities.
How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs: Based on circuit complexity theory, this study rigorously proves that low-precision (e.g., int4/int8) Transformers require super-polynomial size to solve iterative addition and integer multiplication, whereas standard-precision (float32) Transformers can efficiently solve three classes of arithmetic tasks with constant depth and polynomial width. The critical impact of precision on arithmetic capability is empirically verified on LLaMA-3.1-8B.
How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond: This paper presents the first systematic survey on the principles, formal taxonomy, and open challenges of Human-Model Cooperation. It proposes a taxonomy of three cooperation paradigms based on "who makes the final decision" (sequential, triage-based, and joint cooperation), and outlines the role frameworks and methodological roadmaps for each paradigm.
HumT DumT: Measuring and Controlling Human-like Language in LLMs: This paper proposes HumT, a metric for human-like tone based on GPT-2 log-probability ratios, and its social perception generalization SocioT. Analysis of over 400k preference samples reveals that users generally prefer LLM outputs with lower human-likeness. Furthermore, human-like tone strongly correlates with social closeness ($r=0.87$), low status ($r=-0.80$), and femininity ($r=0.47$). Finally, DPO fine-tuning (DumT) using only 500 preference pairs effectively reduces human-likeness without sacrificing model performance.
HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation: Constructs a grammar generation dataset comprising 540 challenges, designs 6 evaluation metrics, and proposes HyGenar, an LLM-driven hybrid genetic algorithm, significantly improving LLMs' ability to generate BNF grammars from few-shot examples.
Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge: This paper proposes the EmoBi framework, which utilizes a three-stage prompting process consisting of emotion analysis, emotion-guided domain mapping, and bidirectional dynamic interaction. By leveraging LLMs to uncover emotional cues behind hyperbole and metaphor as well as their mutually reinforcing relationship, the method substantially outperforms state-of-the-art (SoTA) approaches across four datasets (achieving a 28.1% F1 gain for hyperbole detection on TroFi and a 23.1% F1 gain for metaphor detection on HYPO-L).
Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction: This paper proposes the RULEARN benchmark (comprising 300 handcrafted interactive text environment puzzles across three scenarios) and the IDEA framework (an iterative cycle of abductive hypothesis generation $\rightarrow$ deductive plan validation $\rightarrow$ inductive feedback refinement). The framework achieves a 50.33% success rate on GPT-4o (+7% vs. the ReAct baseline), which remains significantly below the human performance of 63.33%. Fine-grained human evaluation reveals the fundamental bottleneck of LLMs during the hypothesis refinement stage.
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World: A qualitative and quantitative analysis of 79,951 ChatGPT conversations (WildChat dataset) is conducted to investigate how lonely users utilize LLM services. It is discovered that lonely users engage in much longer conversations (12 vs. 5 turns) and 37% seek advice or a listening ear. However, ChatGPT responds inappropriately in severe scenarios like suicidal ideation, and toxic content in lonely conversations reaches up to 55% (compared to 20% in the main corpus), with females being targeted 22 times more often than males.
The Impossibility of Fair LLMs: This work systematically analyzes four mainstream technical fairness frameworks (FTU, multi-sided fairness, group fairness/fair representations, composability of fairness) and demonstrates that all of them face inherent and insurmountable challenges in general-purpose LLM scenarios. It argues that strictly fair LLMs are theoretically impossible and proposes three pragmatic future directions.
Improve Language Model and Brain Alignment via Associative Memory: By performing data augmentation on text to simulate associative memory and applying associative memory instruction tuning on LLMs, this study demonstrates that both approaches significantly improve the alignment between language models and the human brain in speech comprehension tasks, particularly in associative memory-related brain regions such as the medial temporal lobe.
Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization: This paper finds that "retrieval heads" in LLMs are highly correlated with contextual faithfulness in long-form question answering. Based on this, the Rhio framework is proposed: generating unfaithful samples by masking retrieval heads, introducing control tokens for faithfulness-aware tuning, and utilizing contrastive decoding to enhance faithfulness, achieving performance that surpasses GPT-4o on both 7B and 13B models.
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes: This paper proposes using linear classifying probes combined with contrast pairs to extract the latent preference judgments of LLMs. This approach consistently outperforms traditional generative evaluation methods on LLM-as-Judge tasks, and supervised probes even exceed fine-tuned evaluators while maintaining comparable computational costs.
InductionBench: LLMs Fail in the Simplest Complexity Class: This paper proposes InductionBench, an inductive reasoning benchmark based on the subregular hierarchy, which reveals that even the strongest LLMs (such as o3-mini) struggle to master inductive reasoning tasks in the simplest complexity classes, exposing fundamental limitations of current LLMs in inducing rules from observational data.
InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model: This paper proposes InfiniSST, which models unbounded streaming speech simultaneous translation as an LLM multi-turn dialogue task, combining robust segment training data construction, multi-delay augmentation strategies, and Λ-shaped KV cache management to reduce computation-aware latency by 0.5-1 second on MuST-C En-Es/De/Zh directions without losing translation quality.
Information Locality as an Inductive Bias for Neural Language Models: This paper proposes $m$-local entropy, an information-theoretic metric to quantify local uncertainty in language. Through experiments on perturbed natural language and languages defined by Probabilistic Finite-State Automata (PFSA), it is demonstrated that languages with higher $m$-local entropy are harder for Transformer and LSTM language models to learn, revealing that neural language models, like humans, are highly sensitive to the local statistical structure of language.
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages: This paper introduces Injongo, a multicultural intent detection and slot-filling benchmark dataset covering 16 African languages. It is organically generated by native speakers across domains such as banking, travel, home, and food/beverage. Experiments reveal that LLMs perform extremely poorly on slot filling in African languages (GPT-4o achieves only 26 F1), and intent detection also lags significantly behind fine-tuned baselines.
Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs: This work proposes ID-SPAM, which generates input-dependent soft prompts by applying a learnable self-attention layer over input token embeddings followed by a bottleneck MLP. When prepended to the input of only a single Transformer layer, it outperforms various soft prompt baselines and demonstrates excellent zero-shot cross-task and cross-domain transferability.
INTERACT: Enabling Interactive, Question-Driven Learning in Large Language Models: The INTERACT framework is proposed to simulate teacher-student dialogue, enabling a "student" LLM to learn new concepts from a "teacher" LLM by actively asking questions. Experiments on 1,347 unseen contexts demonstrate that interactive learning can improve comprehension accuracy by up to 25%, matching the static learning baseline in just 5 conversational turns.
Interactive and Expressive Code-Augmented Planning with Large Language Models: This paper proposes REPL-Plan, a top-down planning approach that enables LLMs to interact with an extended REPL (Read-Eval-Print Loop). This method leverages the full expressiveness of code while enabling dynamic error correction and handling of vague subproblems, achieving robust performance on ALFWorld, WebShop, and real-world web navigation tasks.
Internal and External Impacts of Natural Language Processing Papers: This work systematically analyzes the impact of ACL/EMNLP/NAACL papers from 1979 to 2024 across both internal (academic citations) and external (patents, media, policy documents) dimensions. It finds that the language modeling topic has the broadest impact, the ethics/fairness topic has prominent impact in policy documents despite low academic citations, and multi-dimensional external impact can effectively predict highly-cited papers internally.
Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style: The authors quantify "memory strength" by measuring the consistency of LLM responses to different paraphrases of the same question, finding that the model's acceptance of external evidence is highly negatively correlated with memory strength, and paraphrased evidence is more effective than repetitive or detailed evidence.
IPO: Your Language Model is Secretly a Preference Classifier: Proposes Implicit Preference Optimization (IPO), which leverages the generative LLM itself as a preference classifier (via the probability of "Yes/No" tokens) instead of an external reward model to obtain preference signals, achieving low-cost self-alignment training.
Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs: Using the English polysemous discourse particle "just" as a case study, this paper systematically evaluates the capabilities of LLMs to understand the fine-grained semantics of discourse particles through two metalinguistic experiments (few-shot semantic labeling and pairwise comparison) using both expert-constructed datasets and movie subtitle annotation data. The findings show that while models can distinguish broad categories (adjectival and temporal meanings), they fail to fully capture the subtle semantic differences of discourse particles (exclusive, unelaboratory, unexplanatory, and emphatic meanings).
JoPA: Explaining Large Language Model's Generation via Joint Prompt Attribution: This work proposes the JoPA (Joint Prompt Attribution) framework, which models prompt attribution for LLM generation tasks as a combinatorial optimization problem. It utilizes a probabilistic search algorithm to efficiently find combinations of input tokens that causally impact the output, thereby addressing the limitation of existing methods that ignore cooperative effects among tokens.
Just a Scratch: Enhancing LLM Capabilities for Self-Harm Detection through Intent Refinement: Proposed the SHINES dataset and the CESM-100 emoji matrix to distinguish between "casual mention" and "serious intent" in self-harm expressions on social media. Combining contextual emoji interpretation and multi-task fine-tuning improved the LLM F1 score for self-harm detection from 0.74 (zero-shot) to 0.88 (multi-task + CESM-100), while generating interpretable reasons for predictions.
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan: This paper proposes KazMMLU, the first MMLU-style bilingual (Kazakh + Russian) evaluation benchmark designed specifically for Kazakhstan. It contains 23,000 multiple-choice questions from authentic educational materials, covering various disciplines (such as STEM, humanities, and social sciences) across multiple educational levels. Using this benchmark to evaluate 27 multilingual LLMs, the study reveals significant deficiencies in current models' Kazakh capabilities.
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons: Proposes Knockout Assessment, an iterative pairwise comparison LLM-as-a-Judge method based on a knockout tournament system. By allowing responses to be compared repeatedly across multiple tournament rounds to establish a global ranking perspective, it achieves an average improvement of 0.07 in Pearson correlation coefficient over individual assessment methods on science exam scoring and machine translation evaluation.
Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations: By probing the internal representations of LLMs, this study reveals that knowledge boundary cognition is linearly structured across multiple languages. A training-free alignment method is proposed to achieve cross-lingual transfer of knowledge boundary perception, and a "weak-to-strong generalization" phenomenon is discovered.
Knowledge Boundary of Large Language Models: A Survey: A formal defining framework for the knowledge boundary of LLMs is proposed, featuring three-tier nested boundaries (Outward⊂Parametric⊂Universal) and four categories of knowledge (PAK/PSK/MSU/MAU). The survey systematically reviews relevant research around three questions: "why, how to identify, and how to mitigate."
Language-Codec: Bridging Discrete Codec Representations and Speech Language Models: Proposed Language-Codec, which bridges the gap between discrete codec representations and downstream speech language models via a Masked Channel Residual Vector Quantization (MCRVQ) mechanism and an improved Fourier transform decoder, achieving high-quality audio reconstruction using only 4 codebook channels.
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions: Proposes directly fine-tuning LLMs on large-scale public opinion survey data (SubPOP, containing 3,362 questions and 70K subpopulation-response pairs) to predict the opinion distributions of different demographic subpopulations, reducing Wasserstein distance by 32-46% compared to prompt engineering baselines while generalizing to unseen surveys and subpopulations.
Language Models Grow Less Humanlike beyond Phase Transition: This paper finds that the alignment of language models with human reading behavior (PPP) during pre-training undergoes an inflection point where it first increases and then decreases. Through correlation and causal experiments, it demonstrates that this inflection point is caused by a phase transition (the rapid emergence of specialized attention heads) during pre-training. Furthermore, this phase transition does not directly produce harmful attention patterns; rather, it alters the model's subsequent learning dynamics, causing a continuous deviation from human patterns.
Large Language Models are Good Relational Learners: The authors propose the Rel-LLM framework, which utilizes a GNN encoder to extract structured subgraph representations from relational databases and injects them as soft prompts into a frozen LLM. It achieves SOTA performance on relational deep learning (RDL) tasks on the RelBench benchmark and supports zero-shot prediction.
Large Language Models for Predictive Analysis: How Far Are They?: Proposes the PredictiQ benchmark—the first comprehensive framework to systematically evaluate the predictive analysis capabilities of LLMs. Integrating 44 real-world datasets across 8 domains and 1,130 expert-designed queries, the benchmark evaluates 12 mainstream LLMs across three dimensions and seven aspects (text analysis, code generation, and text-code alignment). It reveals that even the strongest model, GPT-4o-mini (GPT4O3Mini), still exhibits significant deficiencies in depth of analysis (2.91/4) and data preprocessing (absent in 51% of cases).
Large Language Models in Bioinformatics: A Survey: This paper systematically reviews the progress of large language models in four major areas of bioinformatics (DNA/genomics, RNA, protein, and single-cell analysis), covering the architectures, tasks, and datasets of over 30 representative models, and discusses core challenges and future directions such as data scarcity, computational complexity, and cross-omic integration.
LazyReview: A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews: This work introduces LazyReview, the first fine-grained classification dataset for "lazy thinking" in NLP peer reviews, containing 500 expert-annotated and 1,276 silver-annotated instances. Through a three-round iterative annotation protocol and positive example enhancement, annotation consistency is doubled. The study demonstrates that instruction-tuning LLMs on this dataset improves detection performance by 10–20 percentage points, and controlled experiments show that providing lazy thinking feedback significantly improves review quality.
Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery: This paper proposes the DISCOG (DISCOvery Graph) system, which integrates knowledge graphs with LLM-driven reasoning for document retrieval and classification in electronic discovery (eDiscovery). It outperforms strong baselines on both balanced and imbalanced datasets, reducing litigation document review costs by approximately 98% in real-world deployment.
Length Controlled Generation for Black-box LLMs: An iterative sampling framework based on the Metropolis-Hastings algorithm, integrated with an importance sampling acceleration strategy, is proposed to achieve precise length control for black-box LLMs without modifying model parameters. It achieves a 100% length control success rate on Llama3.1 in at most 5 iterations, without compromising generation quality.
LESA: Learnable LLM Layer Scaling-Up: Proposed LESA, a learnable depth scaling-up method that discovers latent inter-layer patterns via SVD and predicts intermediate layer parameters using a neural network. Compared to heuristic layer replication methods, LESA achieves better initialization and faster convergence, reducing training costs by more than half.
Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility: This paper utilizes the known asymmetry in human "pronoun production" versus "pronoun interpretation" with implicit causality verbs as a testbed to systematically evaluate whether instruction-tuned LLMs can replicate this human cognitive asymmetry. It finds that model size and the choice of meta-linguistic prompts are the deciding factors.
Leveraging In-Context Learning for Political Bias Testing of LLMs: This paper proposes "Questionnaire Modeling" (QM), a novel probing task that utilizes human survey data as in-context examples to improve the stability of LLM political bias detection. The study finds that instruction tuning can alter the direction of bias, and larger models can more effectively leverage in-context examples and exhibit smaller bias scores.
Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora: This paper proposes leveraging the contextual understanding capabilities of LLMs to detect and quantify gender representation bias in the training corpora of grammatically gendered languages (e.g., Spanish and Valencian). A severe male-dominated imbalance is identified, and continuous pre-training using reverse-biased data is verified to effectively mitigate bias in model outputs.
ReCall: Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles: This paper introduces the concept of "Self-Referencing Causal Cycles" (ReCall), revealing how naturally occurring repetitive token sequences in LLM pre-training data form circular references. This enables autoregressive models to bypass unidirectional causal constraints and overcome the reversal curse. Based on this, a two-step ReCall-aware prompting strategy is designed.
On the Limit of Language Models as Planning Formalizers: This study systematically evaluates the limits of the "LLM-as-Formalizer" methodology. For the first time, LLMs are required to generate complete PDDL representations (rather than partial ones) to formalize planning domains from textual descriptions of varying levels of naturalness. The strongest models (GPT-4o/o3-mini/DeepSeek-R1) can effectively formalize, outperforming direct planning, but performance decreases as descriptions become more natural. Weak models struggle with syntax errors, whereas strong models face semantic errors.
Literature Meets Data: A Synergistic Approach to Hypothesis Generation: This work proposes the first method that synergistically integrates literature-driven and data-driven hypothesis generation. Through Refinement and Union strategies, LLMs are enabled to jointly generate more generalizable hypotheses from paper abstracts and observational data. The proposed approach achieves an average improvement of 3.37% over purely data-driven methods across OOD datasets for five social science classification tasks. Furthermore, human experiments demonstrate for the first time that LLM-generated hypotheses can significantly improve human decision-making accuracy (+7.44% / +14.19%).
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs: Proposes LlamaDuo, an automated LLMOps pipeline that iteratively fine-tunes small models using synthetic data generated by service LLMs. This enables 2B-8B local models to approximate or match the performance of large models like GPT-4o on specific downstream tasks, significantly reducing long-term deployment costs.
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding: This work systematically identifies and quantifies three mismatches (input attention / output attention / position IDs) when adapting batch-trained LLMs to streaming scenarios, discovering that only the input attention mismatch is the key bottleneck ($+2.20$ BLEU). Based on this insight, the authors propose Group Position Encoding, where the source and target groups scale consecutive position IDs independently without requiring expensive KV cache re-computation. This approach surpasses specialized streaming architectures in both machine translation and ASR cross-modal tasks.
LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates: LLMBraces dynamically adjusts the contribution weights of sub-updates by computing the relevance scores between each value vector in the FFN layers and the input. Using extremely few parameters (75% fewer than LoRA), it simultaneously improves model prediction accuracy and enables controllable text generation.
LLM as a Broken Telephone: Iterative Generation Distorts Information: Using translation as a testbed to simulate the "Telephone Game" in LLMs, this study finds that information becomes severely distorted after 100 iterations of translation. For example, a news report about a truck driver being fined is transformed into "a car exploded after receiving compensation" after 100 rounds of English-Thai translation. The choice of pivot language, chain complexity, and decoding temperature are identified as key factors regulating the rate of distortion.
LLM×MapReduce: Simplified Long-Sequence Processing using Large Language Models: LLM×MapReduce is proposed, a training-free divide-and-conquer framework that addresses inter-chunk dependency and inter-chunk conflict after chunking long texts using a structured information protocol and an in-context confidence calibration mechanism. This enables LLMs with an 8K context to effectively process long texts exceeding 100K or even 1280K tokens, outperforming long-context models such as GPT-4.
LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs?: Proposes TSG Bench to systematically evaluate the capability of 11 LLMs on scene graph understanding and generation tasks, revealing significant bottlenecks of LLMs in scene graph generation (especially in multi-action decomposition).
Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options: Proposes the concept of "Reflective Judgment" to measure the ability of LLMs to reject choosing when all multiple-choice options are incorrect. It reveals that aligned models (such as GPT-4o) tend to blindly follow instructions to select incorrect options, whereas base models often perform better, and this ability emerges as model scale increases.
LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs: This paper proposes TrickCatcher, which utilizes LLMs to generate program variants and test input generators, combined with a diversity-driven differential testing mechanism to detect "plausible programs" that pass existing test suites but still contain hidden bugs (tricky bugs). TrickCatcher achieves SOTA gains of 1.80× / 2.65× / 1.66× in Recall, Precision, and F1 score, respectively.
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks: Establishes Judge-Bench with 20 datasets (70k+ instances) to systematically evaluate 11 LLM judges against human annotations. Findings reveal colossal performance variance across tasks, attributes, and user expertise, indicating that task-specific human validation remains critical before deploying LLM judges.
LLMs can Perform Multi-Dimensional Analytic Writing Assessments: Using an L2 postgraduate literature review corpus, this study systematically evaluates the capabilities of LLMs in multi-dimensional analytic writing assessment (scoring + commenting) and proposes ProEval, an explainable feedback quality evaluation framework.
LLMs Can Be Easily Confused by Instructional Distractions: This paper reveals that LLMs are severely misled when processing scenarios where the input itself resembles an instruction (instructional distraction). It proposes the DIM-Bench benchmark to evaluate this issue, demonstrating that mainstream LLMs, including GPT-4o, are significantly affected, and existing prompting strategies cannot fundamentally resolve it.
LLMs Know Their Vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts: Proposes ActorBreaker, a multi-turn attack method based on Latour's Actor-Network Theory. By leveraging benign prompts semantically related to harmful content (natural distribution shifts) to bypass safety mechanisms, it achieves state-of-the-art (SOTA) attack success rates on HarmBench, revealing the semantic coverage gap between pre-training and safety training data.
LLMs + Persona-Plug = Personalized LLMs: This paper proposes PPlug, which compresses user historical behavior into a single personalized embedding via a lightweight plug-and-play user embedder to guide LLMs in generating personalized outputs. PPlug significantly outpaces retrieval-based and fine-tuning-based baselines on the LaMP benchmark by up to 35.8%.
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More: This paper demonstrates that the failure of decoder-only LMs on the path-star graph search task is not a fundamental limitation of the next-token prediction paradigm, but is caused by "supervision adulteration"—where excessive teacher-forcing supervision signals induce the model to learn a Clever Hans Cheat shortcut, preventing subtask decomposition. The task is shown to be learnable through six orthogonal methods, including token masking, ranking-into-the-future, scratchpad, and tree topologies.
Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models: This paper proposes the Locate-and-Focus method for terminology translation in speech LLMs. By first using sliding window retrieval to locate audio segments containing terminologies, and then guiding the model to focus on translation knowledge through audio replacement and Tag Cues, the terminology translation success rate is significantly improved for English-Chinese and English-German directions.
Logical Forms Complement Probability in Understanding Language Model (and Human) Performance: This paper systematically investigates LLM capabilities in propositional and modal logic reasoning, finding that in addition to input probability (perplexity), logical form (modality, argument form) is an important complementary factor in predicting LLM performance. It further compares these findings with human behavioral data to reveal similarities and differences in human-machine reasoning.
LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information: This paper proposes LongDPO, which collects step-level preference pairs via MCTS, maintains factual consistency using a global memory pool, and enhances low-quality candidates through critiques. It then performs fine-grained optimization using stepwise DPO, significantly improving long-form text generation quality on LongBench-Write while preserving general capabilities.
Lost in Literalism: How Supervised Training Shapes Translationese in LLMs: This paper systematically investigates the phenomenon of translationese in machine translation compiled by Large Language Models (LLMs), revealing that translationese bias in supervised fine-tuning (SFT) data is the root cause of unnatural translations in LLMs. To mitigate this issue, the paper proposes approaches that polish training reference translations and filter out unnatural training instances.
LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems: The LR²Bench benchmark is proposed to systematically evaluate the long-chain reflective reasoning capabilities of LLMs across six types of Constraint Satisfaction Problems (CSPs). The evaluation reveals that even state-of-the-art reasoning models like DeepSeek-R1 and o1-preview only achieve an average Exact Match of 20.0% and 23.6%, respectively, highlighting substantial room for improvement in reflective reasoning.
Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats: Proposes the FETCH! benchmark and the EarShot system to discover emergent "dog whistles" (coded expressions with dual meanings) in large-scale social media corpora, leveraging a combination of vector databases and LLMs to achieve a 2-20 percentage point improvement in F-score over existing methods.
Mapping 1,000+ Language Models via the Log-Likelihood Vector: This paper proposes mapping 1,000+ language models into a unified space using the log-likelihood vector, proving that the Euclidean distance between vectors approximates the KL divergence. This approach enables model clustering visualization, benchmark performance prediction ($r=0.96$), and data leakage detection.
MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment: This work is the first to model "search motivation"—the genuine user needs latent in pre-search consultation behaviors—and proposes the MAPS framework that integrates LLM semantics, MoAE pooling, and dual-alignment mechanisms, improving HR@10 by 24.4% (from 0.5685 to 0.7071) on real-world commercial data.
Masking in Multi-hop QA: How LMs Perform with Context Permutation: Through systematic document permutation experiments and attention weight analysis, this study reveals that causal masking is a structural bottleneck for decoder-only LLMs in multi-hop QA, and demonstrates that replacing causal masking with a prefix mask significantly improves both performance and robustness.
MasRouter: Learning to Route LLMs for Multi-Agent Systems: This work defines the Multi-Agent System Routing (MASR) problem for the first time and proposes MasRouter, a cascade controller network. It sequentially determines the collaboration mode, role allocation, and LLM routing. While maintaining high performance, it reduces the inference cost of MAS by up to 52%, achieving an effective balance between performance and efficiency.
MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion: Proposed the MathFusion framework, which synthesizes mathematical problems pairwise into new challenges using three problem fusion strategies (sequential, parallel, and conditional fusion). With only 45K additional synthesized data, it yields an average improvement of 18 percentage points in mathematical reasoning across multiple benchmarks.
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes: Proposes MathNeuro, a computationally efficient method requiring only forward passes, which isolates parameters exclusive to mathematical reasoning in LLMs by filtering out parameters that are also important for general language tasks. Pruning these parameters removes mathematical capability, while scaling them enhances mathematical performance by 4-35%.
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following: This paper proposes the MDCure framework, which automatically constructs high-quality multi-document instruction data via a two-stage pipeline (generation and filtering). It trains MDCureRM, a multi-objective reward model, for data filtering. Fine-tuning LLMs (up to 70B) with this data yields performance improvements of up to 75.1% over baselines on multi-document and long-context tasks, demonstrating strong cross-task and cross-domain generalization capabilities.
Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility: This paper proposes a three-level hierarchical system of natural language understanding capabilities (lexical/sentential/discourse), utilizing anaphora accessibility as a diagnostic task for discourse-level understanding. Through an evaluation dataset inspired by dynamic semantics, it systematically investigates LLMs' discourse understanding capabilities under three linguistic structures: universal quantifiers, negation, and disjunction.
Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility: This paper proposes a hierarchical framework for semantic NLU capabilities (lexical, sentential, and discourse levels) and constructs an evaluation dataset based on anaphora accessibility. It is found that while LLMs align with humans on certain structures, they systematically diverge on others—LLMs rely on lexical cues rather than structural abstractions.
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents: This work constructs MemBench, the first evaluation benchmark for LLM Agent memory capabilities that simultaneously covers two interaction scenarios (participation/observation), two memory levels (factual/reflective), and four evaluation metrics (accuracy/recall/capacity/efficiency). Evaluations across seven memory mechanisms demonstrate that simple RetrievalMemory performs best under large-scale memory conditions ($100\text{K}$ tokens) with an accuracy of $0.833$, while complex mechanisms (MemGPT, GenerativeAgent) fail to show advantages.
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models: Proposes MEraser (Mismatched Eraser), which completely removes backdoor-based fingerprint watermarks in LLMs using less than 1000 samples through a two-phase fine-tuning strategy (mismatched data erasure + clean data recovery) while preserving model performance, and pioneers transferable LoRA erasure adapters.
MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models: This paper proposes MergePrint, the first black-box LLM fingerprinting verification method tailored for model merging scenarios. By simulating merging behavior with a pseudo-merged model and employing a two-stage optimization (input optimization + parameter optimization), the embedded fingerprint remains detectable after merging, achieving efficient, harmless, and tamper-resistant ownership verification.
Meta-Reflection: A Feedback-Free Reflection Learning Framework: Proposes the Meta-Reflection framework, which stores and retrieves reflective insights through a learnable meta-reflection codebook. This enables LLMs to utilize historical reflective experience to improve output quality with only a single forward pass during inference, without requiring external feedback or multi-round iterations. Significant improvements are achieved across programming, mathematical reasoning, and e-commerce intent detection tasks.
MExGen: Multi-Level Explanations for Generative Language Models: The MExGen framework is proposed to map text outputs of generative models to real values via a scalarizer, perform multi-granularity linguistic segmentation, and apply linear-complexity attribution algorithms (C-LIME/L-SHAP). It provides more faithful input attribution explanations for context-driven text generation (summarization, QA) than PartitionSHAP and LLM self-explanations.
MHA2MLA: Towards Economical Inference by Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs: MHA2MLA proposes the first method to efficiently migrate pre-trained MHA models to DeepSeek's MLA architecture. By utilizing contribution-aware partial-RoPE removal and joint SVD low-rank approximation, the performance can be restored with only 0.6%-1% of training data, compressing the KV cache of Llama2-7B by 92.19% with only a 1% drop in LongBench performance.
Mind the (Belief) Gap: Group Identity in the World of LLMs: By simulating Belief Congruence theory within a multi-agent LLM framework, this work reveals that LLMs exhibit a stronger belief congruence bias than humans, which increases misinformation propagation and impairs learning capability. The authors propose three mitigation strategies based on social psychology.
Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy: This paper systematically investigates how prompt politeness levels influence LLM response accuracy. By constructing 250 multiple-choice prompts across 5 tone gradients (ranging from "Very Polite" to "Very Rude") and testing them on ChatGPT 4o, the authors counterintuitively find that rude prompts achieve significantly higher accuracy (84.8%) than polite prompts (80.8%).
MindRef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness: Proposed the MindRef framework, which mimics the human two-stage memory pattern of first recalling document titles and then locating specific passages. Through Trie and FM-Index constrained decoding, it enables LLMs to independently recall reference passages without the need for additional retrieval models or pre-chunking.
MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments: This paper proposes MIRAGE, an evaluation framework that systematically assesses the performance of LLMs in complex social interactive environments through eight carefully designed Murder Mystery game scenarios and four core metrics (Trust Inclination Index [TII], Clue Investigation Capability [CIC], Interaction Capability Index [ICI], and Script Compliance Index [SCI]). The findings reveal that even GPT-4 faces severe challenges in these scenarios.
Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel: It is discovered that specific channels in the hidden states of LLMs encode absolute position information (positional hidden states). By scaling this single channel, the "lost in the middle" position bias can be mitigated, yielding up to a 15.2% improvement on multi-document QA benchmarks without affecting other capabilities of the model.
Mixture of Small and Large Models for Chinese Spelling Check: This paper proposes dynamically mixing the probability distributions of a small model (fine-tuned BERT) and a large language model (LLM) during the Beam Search decoding phase for Chinese spelling correction. Without fine-tuning the LLM, this approach balances the precise correction of the small model with the fluency of the LLM, achieving SOTA performance on multiple CSC datasets.
Mixtures of In-Context Learners: This paper proposes MoICL, which partitions the set of demonstrations in ICL into multiple subsets (experts) and blends their next-token distributions using a learnable weight function. This approach significantly improves the accuracy, robustness, and efficiency of ICL without modifying LLM parameters.
Comparing Moral Values in Western English-speaking Societies and LLMs with Word Associations: Proposes an LLM moral assessment framework based on word association rather than direct questioning, and constructs global moral networks (GMN) for humans and LLMs. The study finds high consistency between the two in positive moral dimensions, but shows that LLMs are systematically more abstract, less emotional, and less concrete on negative moral concepts.
MOSAIC: Multiple Observers Spotting AI Content: Based on the universal compression principle in information theory, this paper proposes MOSAIC, an AI-generated text detection method that enensembles multiple LLMs. By using the Blahut-Arimoto algorithm to compute optimal combination weights for multiple detector LLMs, it constructs a mixture distribution as an observer. It determines whether text is AI-generated by comparing the actual surprisal of the text with the expected cross-entropy difference of the mixture model, robustly outperforming single-model and two-model (such as Binoculars) approaches across multiple domains, languages, and generators.
Multi-Prompting Decoder Helps Better Language Understanding: The Multi-Prompting Decoder (MPD) framework is proposed, which queries pre-trained language models (PLMs) with multiple prompts to obtain multiple sets of hidden states and class scores. Combined with optimal transport matching and calibrated decoding strategies, it significantly outperforms existing methods on few-shot classification tasks in MaaS (Model-as-a-Service) scenarios.
Multi-Attribute Steering of Language Models via Targeted Intervention: This paper proposes MAT-Steer, which achieves precise, simultaneous inference-time intervention for multiple LLM attributes (e.g., truthfulness, toxicity, bias) via an attribute-aware token-level gating mechanism and orthogonality constraints, comprehensively outperforming existing ITI and fine-tuning methods on QA and generation tasks.
Which of These Best Describes Multiple Choice Evaluation with LLMs?: Systematically demonstrates that MCQA, as a standard evaluation format for LLMs, suffers from three major categories of flaws: (1) format flaws—inability to test generative or subjective tasks, mismatch with real-world LLM use cases, and failure to fully assess depth of knowledge; (2) dataset flaws—leakage, unanswerability, shortcuts, and saturation; and (3) model behavior flaws—poor robustness, option bias, and unfaithful explanations. Systematic remedies such as Constructed Response, Explanation MCQA, and IRT analysis are proposed by borrowing insights from psychometrics.
"My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise: This work introduces humblebragging detection to the field of computational linguistics for the first time, proposing a 4-tuple formal definition, constructing the HB-24 synthetic dataset, and conducting a comprehensive benchmark evaluation across ML/DL/LLM. GPT-4o achieves a 0.88 F1 under the zero-shot + definition setting, outperforming human annotators.
Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review: This scoping review of 129 studies (2019-2024) follows the PRISMA guidelines, using the five-step EBM process (Ask-Acquire-Appraise-Apply-Assess) as an organizational framework to comprehensively survey the current application status, technological evolution pathways, and future directions of NLP technologies in evidence-based medicine.
NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model: This paper proposes NeKo, a multi-task post-recognition error correction language model based on a Tasks-Guided Mixture-of-Experts (MoE) architecture. NeKo achieves state-of-the-art (SOTA) performance across multiple cross-modality error correction tasks—including ASR, speech translation, and OCR—and outperforms GPT-3.5 and Claude-3.5 Sonnet in zero-shot scenarios.
Neural Topic Modeling with Large Language Models in the Loop: Proposed the LLM-ITL framework, which integrates LLMs in an "in-the-loop" manner into Neural Topic Model (NTM) training. Using an optimal transport-based topic alignment objective and a confidence weighting mechanism, it significantly improves topic interpretability while maintaining document representation quality and computational efficiency.
NewsInterview: a Dataset and a Playground to Evaluate LLMs' Grounding Gap via Informational Interviews: The authors constructed a dataset of 40,000 news interview dialogues and discovered that LLMs lack acknowledgement (by over 50%) and topic-switching capabilities (by 30%) in interview scenarios. Additionally, they designed a simulated game environment with persuasion mechanisms (NewsInterview), demonstrating that even the best LLM (gpt-4o) can only extract 50.4% of the target information items.
Not Quite Sherlock Holmes: Language Model Predictions Do Not Reliably Differentiate Impossible from Improbable Events: Through meticulously designed minimal pair experiments, this paper reveals that language models cannot reliably differentiate "impossible events" from "improbable but possible events." Under adversarial conditions (where possible sentences contain unrelated words and impossible sentences contain related ones), all 35 tested models, including Llama 3, Gemma 2, and Mistral NeMo, perform below chance level.
Nudging: Inference-time Alignment of LLMs via Guided Decoding: This paper proposes Nudging, a training-free inference-time alignment algorithm. It utilizes a small aligned model to inject a small number of "nudging tokens" to guide the output when the base model is uncertain, achieving or even surpassing the performance of large aligned models with a model that is 7-14 times smaller.
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens: OLMoTrace is proposed, the first system capable of tracing language model outputs back to their complete multi-trillion-token training data verbatim in real time (average 4.5 seconds). Based on an extended infini-gram engine with suffix array indexing, the system achieves highly efficient and precise matching, supporting application scenarios such as fact-checking, creative attribution, and mathematical capability tracing.
On Entity Identification in Language Models: This paper proposes a clustering-based evaluation framework (Purity/Inverse Purity) to analyze the entity identification capabilities in LLM internal representations. It finds that entity information becomes linearly separable ($F_1 \approx 0.9$) within a 20-dimensional subspace in early layers (~ normalized position 0.2), and different LLMs converge to structurally isomorphic entity encodings. This provides systematic evidence for the "emergence of discrete knowledge structures in LLMs from raw text training."
On the Acquisition of Shared Grammatical Representations in Bilingual Language Models: By training small, controlled bilingual language models, this paper investigates the mechanisms of shared cross-lingual grammatical representations using the structural priming paradigm from psycholinguistics. The study finds that the cross-lingual structural priming effect is asymmetric across language pairs and significantly weaker for typologically more distant language pairs (e.g., English-Greek).
On the Mutual Influence of Gender and Occupation in LLM Representations: By approximating the gender direction in the LLM embedding space, this study systematically investigates the bidirectional influence between the gender representation of first names and occupational contexts: occupational contexts shift the gender representation of names, while the gender representation of names in turn affects the biased behaviors of LLMs in occupation prediction tasks, though the correlation between the two is only moderate.
On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs: This paper systematically investigates the risk of "evidence pollution" in malicious social text detection in the LLM era. It proposes 13 pollution methods and 3 defense strategies, finding that LLM-generated fake evidence can cause detector performance degradation by up to 14.4%, and existing defense strategies face practical deployment challenges.
One for All: Update Parameterized Knowledge Across Multiple Models with Once Edit: OnceEdit is proposed to update knowledge across multiple LLMs through "once edit, multi-model update" by editing a lightweight plug-in model and utilizing heterogeneous model ensemble techniques to transfer the edited knowledge. It significantly outperforms existing methods on the ZsRE and Counterfact datasets.
Open-Set Living Need Prediction with Large Language Models: This work proposes the PIGEON system, which reformulates user need prediction on life service platforms from a closed-set classification problem into an open-set generation problem. It leverages GNN-based behavior embeddings to retrieve historical records to assist LLM prediction, refines predictions guided by Maslow's hierarchy of needs, and fine-tunes a text embedding model to retrieve services from flexible needs, achieving an average improvement of 19.37% on real Meituan data.
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models: Proposes OpenCoder, a fully open-source code large language model (including 1.5B and 8B versions) that not only achieves top-tier performance but also serves as an "open cookbook" by releasing reproducible data processing pipelines, pretraining datasets, ablation studies, and training protocols, providing foundational infrastructure for research in code intelligence.
P3: Prompts Promote Prompting: This paper proposes the P3 framework, which is the first to optimize both system prompts and user prompts simultaneously. High-quality prompt templates are generated through offline iterative optimization, which are then utilized for online query-dependent prompt optimization. This approach outperforms methods that optimize only a single prompt side on both general and reasoning tasks, such as Arena-Hard, AlpacaEval, GSM8K, and GPQA.
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs: The Palm dataset, constructed over a year through a community-driven effort by 44 Arabic-world researchers, covers all 22 Arabic countries, 20 cultural themes, and 10 dialects, comprising 17,411 human-created instruction-response pairs for evaluating and improving Arabic cultural and dialectal capabilities of LLMs.
Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub: This paper proposes the OpenAgent system, which autonomously searches, configures, applies, and stores GitHub repositories as tools through a four-stage process of Search→Setup→Apply→Store, successfully solving open-domain tasks for LLMs in specialized areas like finance, chemistry, and biology, with an average success rate of 69.4%.
Past Meets Present: Creating Historical Analogy with Large Language Models: This paper defines the "historical analogy acquisition" task for the first time, systematically explores LLM-based retrieval and generation methods, and proposes a self-reflection mechanism to mitigate hallucinations and stereotype issues in LLM-generated historical analogies. The potential of LLMs in historical analogy is validated through human and automatic multidimensional evaluations.
Personalized Generation In Large Model Era: A Survey: The first systematic survey on cross-modal Personalized Generation (PGen), presenting a unified user-centric perspective to integrate research from NLP, CV, and IR communities under a single framework, covering six modalities: text, image, video, audio, 3D, and cross-modality.
Perspective Transition of Large Language Models for Solving Subjective Tasks: Proposes RPT (Reasoning through Perspective Transition), which allows LLMs to sequentially explore direct, role-playing, and third-person perspectives within a single prompt, rank them by confidence, and perform reasoning with the optimal perspective. It consistently outperforms fixed perspective and ensemble baselines across 12 subjective tasks and 4 models (GPT-4/GPT-3.5/Llama-3/Qwen-2), achieving an average improvement of +4.56 points on GPT-3.5.
Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models: Through the redefinition of physical/mathematical constants and measurement units (e.g., "let $\pi=500$"), this paper systematically investigates the performance of LLMs on inverse scaling tasks. The study reveals that larger models are more prone to anchoring to pre-existing memorized values and failing to follow prompt-based redefinitions, with incorrect confidence (giving wrong answers instead of abstaining) also increasing with scale.
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities: This is the first survey on LLM planning capabilities that introduces a six-dimensional evaluation framework (completeness, executability, optimality, representation, generalization, efficiency) based on classical planning theory (Kartam & Wilkins 1990). It systematically reviews foundational paradigms ranging from task decomposition to search algorithms, while identifying key unresolved directions such as multi-agent planning, hallucinations, and alignment with human preferences.
Planning-Driven Programming: A Large Language Model Programming Workflow: This work proposes LPW (LLM Programming Workflow), a two-phase workflow integrating "solution generation -> plan verification -> code implementation -> precise debugging based on plan verification." LPW significantly improves LLM code generation accuracy, achieving new SOTA results on GPT-4o with 98.2% on HumanEval, 84.8% on MBPP, and 59.3% on LiveCode.
PiFi: Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models: The PiFi framework is proposed, which inserts a single frozen layer of an LLM into an SLM and fine-tunes the combined model, significantly boosting SLM performance on NLU and NLG tasks with minimal computational overhead.
KoGEM: Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean: Proposes KoGEM (Korean Grammar Evaluation Benchmark), which contains 1,524 multiple-choice questions based on theoretical linguistics classification, covering 16 subcategories across 5 major domains: phonology, morphology, syntax, semantics, and prescriptive grammar. It evaluates 27 LLMs under a zero-shot setting and compares them with humans, revealing that LLMs perform significantly worse than humans on linguistic subcategories requiring experiential knowledge (e.g., pronunciation rules and phonological changes), while explicitly supplementing experiential knowledge (e.g., pronunciation text, morphemic decomposition) can lead to substantial improvements.
Only a Little to the Left: A Theory-grounded Measure of Political Bias in LLMs: This paper replaces the unscientific Political Compass Test (PCT) with the validated World Values Survey (WVS) from political science. Designing 30 prompt variations across 11 open-source and commercial LLMs, 88,110 open-ended responses were collected, and a stance classifier was trained for automated annotation. The study finds that instruction-tuned models generally lean left, but bias measurements are highly sensitive to prompts, and the PCT exaggerates the political bias of specific models (e.g., GPT-3.5).
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges: A systematic survey of resources in 58 papers evaluating the pragmatic abilities of NLP models. It categorizes them by pragmatic phenomena (context/deixis, implicature/presupposition, speech acts, discourse coherence, social pragmatics), categorizes task designs (MCQ/QA/NLI/reference games, etc.) and data construction methods (bottom-up/top-down), and reveals core gaps in current evaluations (English-centric bias, unimodal limitations, lack of fine-grained evaluation), providing a roadmap for pragmatic evaluation in the LLM era.
PRAISE: Enhancing Product Descriptions with LLM-Driven Structured Insights: This paper proposes PRAISE, a 4-step LLM pipeline (Attribute Extraction → Cross-Product Comparison → Semantic Grouping → Structured Presentation) that automatically generates structured insights from Amazon product descriptions using Gemini 2.0 Flash. Validated on 90 products across 9 categories, the multi-step pipeline significantly outperforms single-shot generation. The extraction quality is highly correlated with product subjectivity (Arts & Crafts F1=0.82 vs. Books F1=0.36), requiring only $2R+1$ API calls per product.
Pre³: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation: Pre³ is proposed to transform LR(1) grammars into Deterministic Pushdown Automata (DPDA). By precomputing prefix-conditioned edges to eliminate runtime non-deterministic exploration, it significantly accelerates structured LLM generation, reducing per-token latency by up to 40% and improving throughput by up to 36%.
Prediction Hubs are Context-Informed Frequent Tokens in LLMs: This paper presents the first systematic analysis of the hubness phenomenon in autoregressive LLMs. It theoretically proves that the probability distance used in LLM prediction is unaffected by the distance concentration effect. Empirically, it finds that prediction hubs are context-modulated high-frequency tokens (constituting "benign hubs"), whereas using Euclidean distance to compare LLM representations leads to harmful nuisance hubs.
Substance over Style: Evaluating Proactive Conversational Coaching Agents: Through health coaching expert interviews and a user study (31 participants, 155 conversations), this study systematically evaluates LLM coaching agents across five different conversational styles (Directive, Interrogative, Facilitative). The findings show that users highly value core functionality (substance) and holds negative attitudes toward stylistic embellishments (style) when substance is lacking, while also revealing significant inconsistencies between first-person user evaluations and third-party expert/LLM evaluations.
Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning: A two-stage framework is proposed: first aggregating continuous moral ratings from multiple LLMs into collective consensus probabilities using a truncated normal distribution EM algorithm, and then optimizing token-level embeddings representing ethical theories of outlier models to align them with the collective consensus, achieving coherent moral reasoning across multiple LLMs.
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning: Proposes a curriculum In-Context Learning (ICL) strategy guided by Problem-Solving Logic. It selects and orders demonstration examples by analyzing the step-by-step structure of problem-solving, which effectively enhances the complex reasoning capabilities of LLMs.
ProgCo: Program Helps Self-Correction of Large Language Models: ProgCo proposes using LLMs to automatically generate and execute verification pseudoprograms (ProgVe) to check the correctness of their own answers, and then utilizes a dual reflection and correction mechanism (ProgRe) on both the answers and the verification programs to achieve reliable self-correction. This significantly improves correction success rates on instruction-following and mathematical reasoning tasks.
Psycholinguistic Word Features: A New Approach for the Evaluation of LLMs Alignment with Humans: This paper systematically proposes the use of psycholinguistic word norms (Glasgow: 5,553 words $\times$ 7 features + Lancaster: 39,707 words $\times$ 6 sensory modalities, totaling 13 lexical features) to evaluate the alignment between LLMs and humans. The study finds that while GPT-4o shows a relatively high correlation on Glasgow emotional/conceptual features, all models perform extremely poorly on Lancaster sensorimotor features, quantitatively revealing the fundamental limitation of LLMs lacking embodied cognition.
Aligning Large Language Models with Implicit Preferences from User-Generated Content: Propounds the PUGC framework, which leverages implicit human preferences in unlabeled User-Generated Content (UGC) to generate preference data. By converting UGC into queries and reference texts, the framework scores model-generated responses and employs DPO to achieve scalable, domain-specific alignment, reaching a state-of-the-art length-controlled win rate of 35.93% on Alpaca Eval 2 based on Mistral-7B.
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation: QG-SMS proposes simulating student populations with varying comprehension levels using a single LLM. Through a three-step workflow of student profile generation, performance prediction, and analysis, it addresses the severe limitations of existing LLM evaluators in post-test analysis dimensions (item difficulty, discrimination, and distractor efficiency), achieving the highest consistency accuracy across multiple datasets.
QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning: This paper proposes QualiSpeech, the first speech quality assessment dataset featuring 11-dimensional annotations and detailed natural language reasoning descriptions, along with an accompanying evaluation benchmark. It demonstrates that fine-tuned audio LLMs can generate detailed descriptions of noise and distortion, and highlights the potential of reasoning-enhanced quality assessment.
Quantifying Semantic Emergence in Language Models: Proposed Information Emergence (IE), an information-theoretic quantitative metric that quantifies the ability of LLMs to extract semantics from tokens by comparing the difference between macro (sequence-level) and micro (token-level) mutual information across Transformer layers.
Rank, Chunk, and Expand: Lineage-Oriented Reasoning for Taxonomy Expansion: LORex proposes a plug-and-play taxonomy expansion framework that combines the discriminative ranker TEMPORA (taxonomic path verbalization based on Euler paths) and iterative LLM reasoning (semantic filtering $\rightarrow$ parent retrieval $\rightarrow$ path validation). Without fine-tuning LLMs, it achieves a 12% accuracy gain and a 5% Wu&P gain across four benchmarks.
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: This study systematically evaluates the performance of four ranking algorithms (Elo, Bradley-Terry, Glicko, Markov Chain) in head-to-head LLM evaluations. By defining three core ranking criteria (transitivity, prediction accuracy, and hyperparameter sensitivity), the authors reveal that the widely used Elo rating system suffers from severe deficiencies in stability and consistency, and recommend Glicko for large, uneven datasets and Bradley-Terry for small, controlled datasets.
Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives: Drawing on Bloom's Taxonomy and Knowledge Space Theory, this paper proposes the Re-TASK framework to revisit LLM tasks from a three-layer perspective of "capability item - skill - knowledge." It designs the Re-TASK prompting strategy to enhance Chain-of-Thought (CoT) performance on domain tasks through targeted knowledge injection and skill adaptation, achieving up to a 45% improvement on legal tasks.
Reason from Future: Reverse Thought Chain Enhances LLM Reasoning: Proposes the Reason from Future (RFF) reasoning paradigm, which achieves bidirectional reasoning by alternating between reverse reasoning (decomposing backward from the goal) and forward reasoning (approaching the goal from the current state). It significantly outperforms methods like CoT, ToT, and CR on benchmarks including Game of 24, GSM8K, and MATH-500, while substantially reducing the search space.
Recent Advances in Speech Language Models: A Survey: The first comprehensive survey on Speech Language Models (SpeechLMs), systematically tracing the evolution from "ASR+LLM+TTS" cascaded architectures to end-to-end speech language models. It proposes a taxonomy categorized by three key components (speech tokenizer / language model / vocoder) and training strategies, and covers downstream capabilities, evaluation metrics, challenges, and future directions.
Reconsidering LLM Uncertainty Estimation Methods in the Wild: This paper systematically investigates four major challenges (threshold selection sensitivity, query transformation robustness, applicability to long-form text generation, and multi-score ensemble strategies) faced by 19 LLM uncertainty estimation methods during practical deployment, revealing significant limitations of existing methods in real-world scenarios and proposing ensemble strategies as a practical direction for improvement.
Recurrent Knowledge Identification and Fusion for Language Model Continual Learning: Proposes the Recurrent-KIF continual learning framework, which dynamically estimates parameter importance distribution via an inner-outer loop iterative mechanism and utilizes importance-based binary masks for knowledge fusion, effectively mitigating catastrophic forgetting and promoting knowledge transfer.
Red-Teaming LLM Multi-Agent Systems via Communication Attacks: This work proposes the Agent-in-the-Middle (AiTM) attack, which intercepts and tampers with the communication messages between agents in LLM multi-agent systems (rather than directly modifying the agents themselves). By utilizing an adversarial agent equipped with a reflection mechanism to generate context-aware malicious instructions, AiTM achieves an attack success rate of 40% to 100% across various frameworks, communication structures, and real-world applications.
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models: This work presents the first systematic evaluation of 8 salience measures in Sparse Parameter-Efficient Fine-Tuning (SPEFT), discovering that simple gradient-based methods combined with static masks consistently outperform LoRA, challenging the common belief that PEFT requires complex designs.
Identifying Reliable Evaluation Metrics for Scientific Text Revision: This study systematically analyzes the limitations of traditional similarity metrics (such as ROUGE and BERTScore) in evaluating scientific text revisions, revealing that they strongly correlate with edit distance and penalize deep modifications. To address this, a hybrid evaluation method combining LLM-as-Judge with task-specific, cross-domain metrics is proposed, which significantly outperforms any single metric in aligning with human judgment.
Representation Bending for Large Language Model Safety: Proposes RepBend, which integrates the core concept of activation steering (the vector difference between safe and unsafe representations) into the loss function design of LoRA fine-tuning. By "bending" the representation space of the model, it separates safe and unsafe states in the latent space, achieving up to a 95% reduction in Attack Success Rate (ASR) across various jailbreak benchmarks while maintaining minimal impact on general capabilities.
Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes: By evaluating the semantic knowledge of epistemic modality (such as may/must, know/believe/doubt) in 8 open-source LLMs through a controlled storyboard task, this paper reveals that LLMs exhibit limited and non-robust capabilities in generating appropriate epistemic expressions—necessity (must) consistently outperforms possibility (may), and factual statements outperform belief statements.
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation: A unified framework, RetroLLM, is proposed to integrate retrieval and generation into a single autoregressive decoding process. Through hierarchical FM-Index constraints and forward-looking constrained decoding, the LLM is enabled to directly generate fine-grained evidence from the corpus while significantly reducing token consumption.
Retrospective Learning from Interactions: Proposes the ReSpect method, which enables multimodal LLMs to self-improve by retrospectively decoding users' implicit feedback signals in multi-turn interactions without any external annotations, improving the task completion rate from 31% to 82% over thousands of human-machine interactions.
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up: Proposes Reversal of Thought (RoT), a plug-and-play reasoning framework. Through a preference-guided reverse reasoning warm-up strategy, RoT enables LLMs to back-generate "LLM-flavored" optimal prompts from examples, and utilizes a Cognitive Preference Manager to automatically distinguish between known and unknown tasks, outperforming baselines like CoT, ToT, and GoT on multiple reasoning tasks.
Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?: This paper defines the concept of "marker confidence" to measure the actual accuracy when LLMs utilize epistemic markers (e.g., "fairly certain"). Through systematic experiments on 7 models and 7 datasets, it is discovered that epistemic markers show stable performance in in-distribution scenarios but are highly unreliable under out-of-distribution conditions.
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results: This paper reveals a severe length bias issue in the evaluation of language model uncertainty quantification (UQ). Both UQ methods and correctness evaluation functions are affected by response length bias, and their "mutual bias" systematically distorts AUROC rankings. This is demonstrated both theoretically and empirically, while LLM-as-a-judge is found to be the evaluation alternative least affected by length bias.
RiOT: Efficient Prompt Refinement with Residual Optimization Tree: The authors propose Residual Optimization Tree (RiOT), an automatic prompt optimization framework that manages the optimization process through a tree structure, enhances diversity via perplexity-based node selection, and mitigates semantic drift with text residual connections.
Robust Utility-Preserving Text Anonymization Based on Large Language Models: This paper proposes the RUPTA framework, where three LLM components—a privacy evaluator, a utility evaluator, and an optimizer—collaborative work to iteratively edit text, defending against LLM re-identification attacks while preserving downstream task utility, and transferring the anonymization capability to lightweight models via DPO distillation.
RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates: Proposes RoCoFT, an extremely simple parameter-efficient fine-tuning method: only updates a small subset of row or column parameters in the Transformer weight matrices. It achieves accuracy comparable to state-of-the-art PEFT methods like LoRA on tasks such as GLUE, QA, summarization, and commonsense/mathematical reasoning, while reducing memory and computation overhead. The effectiveness of the method is theoretically explained via the Neural Tangent Kernel (NTK) theory.
(RSA)²: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding: This paper proposes the (RSA)² framework, which explicitly models the speaker's rhetorical strategies (e.g., irony, hyperbole) within the probabilistic pragmatics RSA framework. This enables LLMs to correctly understand non-literal meanings without modeling the speaker's motivation, achieving SOTA performance on the irony comprehension dataset PragMega+.
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE Dataset: This paper proposes Rubrik—an explanation quality evaluation rubric inspired by educational assessment, based on a three-tier nested typology ($Commentary \subseteq Justification \subseteq Argument$) and 8 quality dimensions. Along with the CUBE dataset (containing 26K explanations generated by humans and 6 LLMs), it is discovered that the primary cause of low-quality LLM explanations is a lack of conciseness rather than coherence.
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts: This paper systematically assesses the safety robustness of 11 LLM judges, showing that superficial artifacts like apology prefixes distort preferences by up to $98\%$ . A proposed jury-based multi-model aggregation helps but does not resolve the issue.
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition: This paper proposes a highly sample-efficient human evaluation method based on the Maximum Discrepancy (MAD) competition principle. By automatically selecting a subset of instructions that best distinguish the performance differences between LLMs, it significantly reduces the human annotation workload, recovering stable model rankings from large-scale evaluations with only 280 comparisons.
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs: This paper conducts the first systematic evaluation of the impact of role-play fine-tuning on the safety of LLMs, finding that the level of safety degradation is positively correlated with role traits (especially villainous roles). It proposes the SaRFT framework, which adaptively identifies subsets of harmful training data for different roles using an implicit reward function, and combines this with KL-divergence regularization to achieve a Pareto-optimal balance between role expressiveness and safety.
SConU: Selective Conformal Uncertainty in Large Language Models: SConU introduces significance testing into the conformal uncertainty framework of LLMs for the first time. By constructing two types of conformal p-values, it identifies and filters out uncertainty data outliers that violate the exchangeability assumption, thereby achieving strict control over the miscoverage rate in both single-domain and cross-domain QA scenarios.
SCoP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View: From a cognitive science perspective, SCoP decomposes the document comprehension process of LLMs into five progressive skills (Locating, Inferring, Connecting, Organizing, and Selecting). It constructs a test set containing 4,682 samples to evaluate the comprehension "process" rather than just the "answers." The study reveals that LLMs generally perform significantly better in local comprehension (~94%) than in global comprehension (~31%), and that their comprehension processes can be flawed even when the final answers are correct.
SCULPT: Systematic Tuning of Long Prompts: This paper proposes the SCULPT framework, which models long prompt optimization as an iterative modification problem on a hierarchical tree structure. Through a Critic-Actor framework, it conducts structured reflection and operation-level modifications on the prompt, significantly improving LLM task performance while maintaining long prompt information integrity and possessing robustness against adversarial perturbations.
SDD: Self-Degraded Defense against Malicious Fine-tuning: SDD achieves defense by training LLMs to generate high-quality but irrelevant benign responses to harmful instructions: when an attacker performs malicious fine-tuning, the model's general capability significantly degrades, rendering it unable to effectively execute malicious instructions.
SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization: SEE is the first prompt optimization framework that jointly optimizes instructions and examples as a cohesive whole. It designs a four-phase exploration-exploitation strategy based on metaheuristic optimization principles, coupled with the adaptive selection of five LLM operators, significantly outperforming 9 state-of-the-art (SOTA) methods across 35 benchmark tasks.
Stepwise Reasoning Disruption Attack of LLMs: A stepwise reasoning error disruption (SEED) attack method is proposed, where subtle errors (e.g., slightly altered calculation numbers) are strategically injected into the early steps of an LLM's reasoning chain. This forces the model to naturally propagate the error in subsequent reasoning steps and output incorrect answers. It is compatible with both zero-shot and few-shot settings, achieves a detection rate as low as 0.8% on GPT-4o, and reveals a severe security vulnerability in the stepwise reasoning processes of LLMs.
Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models: This paper proposes Segment-Level Diffusion (SLD), which segments long-form text outputs into multiple segments (such as sentences or dialogue turns) and models the latent representations of each segment using diffusion. Combined with contrastive learning and adversarial training to enhance representation robustness, SLD achieves superior long-form generation quality compared to existing diffusion models on tasks such as summarization, story generation, and dialogue generation.
Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation: This paper proposes Self-Foveate, an approach inspired by the human visual foveation mechanism. Through a three-level foveation strategy ("micro-scatter-macro"), it systematically extracts multi-granularity information from unsupervised text to synthesize instruction data with higher diversity and difficulty for instruction tuning of LLMs.
Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs: A self-instructed reinforcement learning framework is proposed to train a "derived prompt generation model." By utilizing the derived prompt-response pairs as in-context learning (ICL) examples to enhance queries of the original prompt, the response quality is significantly improved without modifying the parameters of black-box LLMs (such as GPT-4).
Self-Training Elicits Concise Reasoning in Large Language Models: Discovers that LLM output distributions naturally contain concise reasoning paths and proposes the FS-BoN (Few-Shot conditioning + Best-of-N sampling) self-training framework. By filtering short and correct reasoning samples from the model's own distribution for fine-tuning, the method achieves an average of 30% token reduction across five model families on GSM8K and MATH without sacrificing accuracy, delivering 2.4 times the efficiency of the prior method, Rational Metareasoning.
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching: Inspired by the Feynman technique, a Self-Tuning framework is proposed. Through a three-layer self-teaching strategy of memorization, comprehension, and self-reflection, it significantly enhances the ability of LLMs to effectively acquire and recall knowledge from new documents.
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence: SelfElicit discovers that the attention scores in deep layers of LLMs naturally possess the capability to localize key evidence within the context (even when the model generates incorrect answers). Based on this finding, an inference-time context enhancement method is proposed: by generating only one extra token, it automatically identifies and highlights key evidence sentences to guide the model toward generating more accurate answers.
Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models: To address two major sources of waste in LLM tree-search reasoning—"performing complex searches for simple problems" and "repeatedly expanding semantically identical paths"—this paper proposes the SEAG framework. It first uses an entropy-based gating mechanism to decide whether to activate tree search, and then merges equivalent reasoning steps via semantic clustering. Ultimately, SEAG achieves an average accuracy improvement of 4.3% while requiring only 31% of the inference overhead of RAP.
Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents: To address the difficulty of LLMs in simulating the erroneous behaviors of low-performing students, this paper proposes a training-free framework based on knowledge graph cognitive prototypes. By utilizing a three-stage process—cognitive state modeling $\to$ behavior prediction $\to$ beam search self-refinement—the framework generates realistic student responses, achieving a 100% improvement in simulation accuracy on the Student_100 dataset.
SkillAggregation: Reference-free LLM-Dependent Aggregation: This paper proposes SkillAggregation, a method that learns context-dependent skill weights of LLM judges and performs inference using posterior estimation. It effectively aggregates the predictions of multiple LLM judges without reference labels, outperforming existing aggregation methods across multiple tasks.
SkillVerse: Assessing and Enhancing LLMs with Tree Evaluation: SkillVerse is proposed as an unsupervised, tree-structured LLM diagnostic framework. By organizing evaluation feedback from an LLM-as-Judge into a hierarchical skill tree (dendrogram), it uncovers the strengths and weaknesses of model capabilities at any level of granularity. This is further utilized to select superior few-shot exemplars (improving ICL by up to 25%) and to predict model weaknesses in unseen scenarios (achieving a 55% success rate, which is 22% higher than the uninformed baseline).
SocialEval: Evaluating Social Intelligence of Large Language Models: Proposes SocialEval, a bilingual social intelligence benchmark based on narrative scripts. By manually constructing 153 "World Trees" that model social interactions as goal-conditioned MDPs, it integrates outcome-oriented Goal Achievement Evaluation (GAE) and process-oriented Interpersonal Ability Evaluation (IAE) to systematically evaluate the social intelligence of LLMs in multi-turn social scenarios and their gaps with humans.
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition: SongComposer is the first music-specific Large Language Model capable of simultaneously generating lyrics and melodies. Utilizing a word-level aligned tuple format, a music-knowledge-based scalar pitch initialization, and progressive structure-aware training (motif -> independent song -> phrase-level pairing), it comprehensively outperforms GPT-4 on tasks including lyric-to-melody, melody-to-lyric, song continuation, and text-to-song generation.
SQLong: Enhanced NL2SQL for Longer Contexts with LLMs: This paper proposes SQLong, a data augmentation framework for NL2SQL in long-context scenarios. By injecting synthetic CREATE TABLE statements sampled from other databases into the training data to extend the context length, it enables fine-tuned LLMs to achieve significantly improved SQL generation accuracy in large-scale schema scenarios.
SR-LLM: Rethinking the Structured Representation in Large Language Model: This work proposes the SR-LLM framework, which effectively integrates structured representations (AMR, PST, FOL) into LLMs through two settings: training-free natural language description conversion and training-dependent hybrid-data fine-tuning. It achieves performance improvements of 3.17% and 12.38% respectively on downstream tasks like PAWS, providing the first substantial evidence that structured representations can enhance the reasoning capabilities of LLMs.
Steering off Course: Reliability Challenges in Steering Language Models: This paper systematically evaluates the generalization of three mainstream language model steering methods (DoLa, Function Vectors, Task Vectors) across up to 36 models, revealing severe fragility and high variance issues, as well as fundamental flaws in their underlying assumptions.
STEM-PoM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing: This paper proposes the STEM-PoM benchmark dataset (2K+ math-symbol instances), combining Part-of-Math Tagging with document parsing to systematically evaluate LLMs' capacity to classify contextual polysemy in mathematical symbols. It demonstrates that improvements in symbol classification can be transferred to enhance downstream mathematical reasoning performance.
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors: Fine-tuning LLMs via DPO aligns their writing style with the linguistic feature distribution of human text, generating machine-generated text (MGT) that is significantly harder to detect, exposing the over-reliance of existing MGT detectors on shallow linguistic cues.
Structural Reasoning Improves Molecular Understanding of LLM: This paper proposes the Molecular Structural Reasoning (MSR) framework. By explicitly incorporating six key structural elements of molecules (molecular formula, longest carbon chain, aromatic rings, ring compounds, functional groups, and chiral centers) as intermediate reasoning steps, it significantly improves LLM performance on molecular understanding tasks.
SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment: SudoLM proposes an LLM parametric knowledge access control framework. Through a "SUDO key" mechanism, it allows authorized users to unlock restricted knowledge (e.g., medical domain knowledge) while unauthorized users can only access public knowledge. By utilizing authorization alignment via DPO, it achieves hierarchical access control within a single model—a task that traditionally required multiple model versions.
SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms: Proposes SynapticRAG, which draws inspiration from synaptic transmission and the Leaky Integrate-and-Fire (LIF) model in neuroscience. By fusing temporal association triggers with semantic similarity, it achieves up to a 14.66% improvement over the state-of-the-art (SOTA) on conversational memory retrieval tasks.
Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events: This paper proposes EpiMine, an unsupervised episode detection framework that detects episodes (sub-event segments) under key events from news corpora by synergizing discriminative term co-occurrence-driven article segmentation and LLMs, achieving an average improvement of 59.2% across three real-world datasets.
Systematic Generalization in Language Models Scales with Information Entropy: Demonstrates that the systematic generalization ability of language models is positively correlated with the information entropy of constituent distributions in training data—standard seq2seq models without built-in compositional priors can achieve strong systematic generalization under high-entropy training distributions.
T5Score: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets: The authors propose T5Score, a methodology that decomposes the quality of LLM-generated free-text topic sets (FT-topics) into five quantifiable dimensions (interpretability, topic coverage, document coverage, non-overlap, inner-order). This approach achieves high inter-annotator agreement through simple labeling tasks and validates that LLMs can serve as automated evaluators to replace human effort.
TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models: TableLoRA proposes a specialized LoRA module for table tasks, improving table serialization through a special token encoder and encoding cell row/column positional information with 2D LoRA. Under parameter-efficient fine-tuning (PEFT) settings, it achieves a 5.9% improvement on HiTab compared to vanilla LoRA, bridging 40.56% of the performance gap between LoRA and full fine-tuning.
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora: Proposes the TaxoAdapt framework. By utilizing hierarchical classification-driven depth/breadth expansion and taxonomy-aware clustering, TaxoAdapt dynamically aligns LLM-generated multidimensional taxonomies with specific scientific corpora, outperforming state-of-the-art baselines by 26.51% in path granularity and 50.41% in sibling coherence.
Team Anotheroption at SemEval-2025 Task 8: Bridging the Gap Between Open-Source and Proprietary LLMs in Table QA: Proposed a multi-model collaborative pipeline system for Table QA, integrating three paths: Text-to-SQL, Text-to-Code (Pandas), and end-to-end semantic understanding. By leveraging RAG to retrieve and enhance contexts, and employing Llama 3.3-70B as an Orchestrator to adjudicate the final answer, the system achieved a 13/38 ranking with an 80% accuracy in the SemEval-2025 Task 8 open-source track. On the development set, the open-source combination (88%) significantly outperformed the single GPT-4o model (74%).
Temporal Reasoning for Timeline Summarisation in Social Media: This paper proposes enhancing the temporal reasoning capabilities of LLMs by constructing a new narrative temporal reasoning dataset, NarrativeReason. It transfers temporal reasoning knowledge to smaller models via a knowledge distillation framework, while training them to perform timeline summarisation. This approach achieves state-of-the-art performance and significantly reduces hallucinations in cross-domain mental health summarisation tasks.
TESS 2: A Large-Scale Generalist Diffusion Language Model: TESS 2 is proposed as the first large-scale generalist instruction-following diffusion language model adapted from an existing autoregressive model. Through an adaptation training scheme involving UL2 masking + label shifting + bidirectional attention, combined with reward guidance during inference, it matches or even outperforms equivalent AR models on QA and instruction-following tasks.
TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure: This work proposes TestCase-Eval, a benchmark containing 500 Codeforces competitive programming problems and 100,000 human code submissions. Through two tasks, Fault Coverage and Fault Exposure, this benchmark systematically evaluates the ability of 19 LLMs in test case generation for algorithmic problems. The findings reveal that the strongest model, Qwen3-32B, achieves an exposure rate of only 43.8%, which is far below the 93.3% of human experts.
TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency: TestNUC proposes a linearly scaling test-time inference enhancement method. By retrieving nearest neighbor unlabeled samples for a test instance, it prompts LLMs to predict both the test sample and its neighbors, then aggregates these predictions via weighted majority voting to consistently improve classification accuracy.
The AI Gap: How Socioeconomic Status Affects Language Technology Interactions: Through a large-scale survey of 1,000 users of different socioeconomic status (SES) and an analysis of 6,482 real LLM prompts, this study reveals significant, systematic differences between high- and low-SES groups in terms of language technology usage frequency, interaction styles, and topic choices, calling for the development of more inclusive NLP technologies to narrow the AI gap.
The Nature of NLP: Analyzing Contributions in NLP Papers: Proposes a taxonomy of NLP paper contributions (Knowledge/Artifact $\times$ 8 subcategories), builds the manually annotated dataset NLPContributions ($\approx 2\text{k}$ papers), trains SciBERT to automatically identify contribution statements, and conducts a 50-year longitudinal trend analysis on $\approx 29\text{k}$ ACL Anthology papers, revealing the evolutionary trajectory of NLP research from being linguistics-oriented to method/model-dominant, and recently returning to concerns about humanity and language.
The Role of Deductive and Inductive Reasoning in Large Language Models: This paper proposes the DID (De-In-Ductive) framework to enhance the reasoning capabilities of LLMs by dynamically combining deductive and inductive reasoning. It utilizes a dual-indicator complexity evaluation system consisting of Littlestone dimension and information entropy to guide the question decomposition strategy. It achieves a 70.3% accuracy rate on the AIW benchmark (outperforming ToT's 62.2%) while maintaining lower computational costs.
Theorem Prover as a Judge for Synthetic Data Generation: This paper proposes the TP-as-a-Judge framework, which leverages the Lean theorem prover to verify intermediate reasoning steps generated by LLMs. Combined with iterative auto-formalization and Reinforcement Learning from Theorem Prover Feedback (RLTPF), this work achieves significant improvements on multiple mathematical reasoning benchmarks using only 3,508 samples.
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive: Proposes and validates a theory of response sampling in LLMs, demonstrating that the sampling process is simultaneously driven by dual components: descriptive forces (statistical norms) and prescriptive forces (implicit ideals). This causes samples to systematically skew away from statistical averages toward idealized values. This bias is statistically significant across 15 models and 500 concepts, and scales stronger with larger model sizes.
Theory of Mind in Large Language Models: Assessment and Enhancement: This paper presents a systematic survey of evaluation benchmarks (10+ story-based benchmarks) and enhancement strategies (prompt-only and fine-tuning methods) for the Theory of Mind (ToM) capabilities of LLMs, highlighting that current LLMs still fall significantly short in ToM reasoning and outlining future directions.
TigerLLM - A Family of Bangla Large Language Models: Addressing the severe shortage of LLMs for Bangla (the 5th most spoken language globally), this work develops a high-quality textbook corpus Bangla-TextBook (10M tokens) and native instruction dataset Bangla-Instruct (100K). The trained TigerLLM family surpasses all open-source alternatives and outperforms GPT-3.5 across six benchmarks.
To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization: This work proposes AutoCode, an EM-framework-based method that enables mathematical LLMs to autonomously decide when to use code tools to assist reasoning. By guiding the exploration of high-potential code-triggering decisions in the E-step and optimizing via offline RL in the M-step, a 7B model achieves an $11\%+$ improvement on MATH500.
The Impact of Token Granularity on the Predictive Power of Language Model Surprisal: This paper systematically investigates the impact of subword token granularity (vocabulary sizes from 256 to 128K) on the ability of LM surprisal to predict human reading times. It finds that a moderate granularity of ~8K vocabulary performs best for predicting natural reading times (even outperforming GPT-2), while coarser-grained tokens are more sensitive to garden-path syntactic effects, revealing that the optimal tokenization granularity for cognitive modeling does not align with general NLP standards.
Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs: Proposes the Token Prepending (TP) technique, which prepends the sentence embedding decoded from each layer to the beginning of the sentence. This allows early tokens under causal attention to perceive the complete sentence information, significantly improving the quality of LLM sentence embeddings without requiring any training.
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling: Proposes Token Recycling—a training-free speculative decoding method that stores rejected candidate tokens during the decoding process into a lightweight adjacency matrix, constructs a draft tree via a BFS algorithm, and verifies it utilizing tree attention. It achieves approximately a $2\times$ speedup across all LLM scales with less than 2MB of storage overhead.
ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models: This paper proposes the ToolCoder framework, which reformulates tool learning as a code generation task. By drawing on software engineering principles (requirements analysis → modular design → implementation & execution → error debugging → code reuse), it enables LLMs to perform multi-step tool calls by generating and executing Python code, comprehensively outperforming baselines such as ReAct and CodeAct on RestBench and API-Bank.
ToolSpectrum: Towards Personalized Tool Utilization for Large Language Models: This paper proposes the ToolSpectrum benchmark to define and evaluate the personalized tool utilization capabilities of LLMs for the first time—selecting the most appropriate tools based on user profiles and environmental factors. Experiments demonstrate that personalization significantly improves user experience, but existing LLMs exhibit limited capability in jointly reasoning over both user and environmental factors.
Towards Enhanced Immersion and Agency for LLM-based Interactive Drama: This work proposes an Immersion-Agency paradigm to conceptualize LLM-based interactive drama, and designs two methods—Playwriting-guided Generation and Plot-based Reflection—to enhance story generation quality and player agency, respectively.
Towards Harmonized Uncertainty Estimation for Large Language Models: Proposes the CUE framework, which calibrates existing uncertainty estimation scores by training a lightweight classifier (Corrector) aligned with the target LLM's performance. It achieves harmonized improvements across three dimensions—indicativeness, precision-recall balance, and calibration—with performance gains of up to 60%.
Towards Style Alignment in Cross-Cultural Translation: This paper defines "style alignment" as a core goal of cross-cultural translation for the first time, systematically revealing style neutralization bias and English-centric bias in LLM translation. It proposes the RASTA method, which learns cultural alignment mappings in the embedding space to retrieve style-matched few-shot examples, improving style alignment by up to 56% without degrading translation quality.
Training-free LLM Merging for Multi-task Learning: This paper proposes Hi-Merging, a hierarchical iterative training-free model merging method. By utilizing model-wise and layer-wise pruning and scaling operations combined with contribution analysis, it identifies and resolves parameter conflicts. This merges specialized LLMs across different tasks/languages into a single unified multi-task model, outperforming mixed-data fine-tuning baselines in most scenarios.
Training Language Model to Critique for Better Refinement: This paper proposes Refinement-oriented Critique Optimization (RCO), which uses "Critique Utility" (CU)—the ratio of refinement improvement driven by critique—as the reward signal to train the critic model. It is optimized via an MSE objective function of a DPO variant without directly evaluating critique quality. Across five tasks (dialogue generation, summarization, QA, mathematical reasoning, and code generation), RCO-trained 7B/13B critic models significantly outperform 70B baseline models and the DPCO method on CU and RQS metrics.
Transforming Podcast Preview Generation: From Expert Models to LLM-Based Systems: Spotify proposes using an LLM (Gemini 1.5 Pro) to replace the legacy multi-model feature engineering pipeline for generating podcast preview clips. This approach significantly outperforms the traditional system in both offline human evaluation and online A/B testing, achieving a 4.6% increase in user engagement duration and a 5x improvement in processing efficiency.
TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring: This work proposes the TRATES framework, redefining the role of LLMs in automated essay scoring (AES) from direct scorers to trait-specific feature generators and extractors. TRATES leverages LLMs to automatically convert grading rubrics into assessment questions (sub-traits). By combining these with general writing quality features and prompt-specific features, a regression model is trained. TRATES achieves SOTA across all 8 traits on the ASAP dataset and establishes the first cross-prompt trait scoring baseline on the ELLIPSE dataset.
TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues: The TReMu framework is proposed. By employing time-aware memorization (timeline summarization) and neuro-symbolic temporal reasoning (LLM-generated Python code execution for temporal calculations), it improves the accuracy of GPT-4o on a multi-session dialogue temporal reasoning benchmark from 29.83% to 77.67%.
TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs: This paper proposes TrimLLM. Based on the layer-wise specialization phenomenon, it progressively drops layers that are unimportant to the target domain during the domain-specific fine-tuning process. It achieves a 2.1-5.7x inference speedup without precision loss at a 50-60% compression rate, while operating independently of specialized hardware.
UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions: This paper proposes UAQFact (13,970 questions), a bilingual dataset of unanswerable questions (UAQs), where each question is annotated with factual knowledge from knowledge graphs. It defines three evaluation tasks to measure the capability of LLMs to distinguish UAQs from answerable questions (ABQs), and to utilize internal/external factual knowledge to handle UAQs. Experiments reveal that LLMs struggle to utilize relevant knowledge effectively even when it is already stored.
Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements: First systematic evaluation of LLMs' understanding of English indexicals (I/you/here/tomorrow) by constructing a 1,600-item 2×2 factorial design evaluation set. The study reveals that LLMs heavily rely on irrelevant contextual information rather than grammatical rules for "you/here/tomorrow", and quotation marks have completely opposite effects on different indexicals.
Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?: This paper systematically investigates the impact of increasing the number of examples on the predictive uncertainty of LLMs in long-context ICL. Through uncertainty decomposition, it reveals that performance gains primarily stem from the reduction of epistemic uncertainty ($EU$), and explains the internal mechanism of uncertainty reduction from the perspective of residual stream projection.
Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems: Through semi-structured interviews with 12 practitioners responsible for evaluating representational harms in LLM-based systems, this work reveals that publicly available measurement tools generally fail to meet practitioner needs—being either "not useful" due to insufficient validity/specificity or "not used" due to organizational/institutional barriers. Based on measurement theory and pragmatic measurement frameworks, systematic recommendations for improvement are proposed.
Understanding Silent Data Corruption in LLM Training: This paper presents the first systematic study on the impact of real-world Silent Data Corruption (SDC) on LLM training. By pairing unhealthy nodes with healthy ones and introducing synchronization mechanisms, the authors reveal SDC characteristics and impact patterns across three levels: submodule computation, single-step gradients, and cumulative training.
Understanding the Dark Side of LLMs' Intrinsic Self-Correction: This paper systematically investigates the failure of LLMs' intrinsic self-correction. It proposes three interpretability methods to reveal the underlying reasons: answer wavering and prompt bias in simple tasks, and human-like cognitive biases in complex tasks. Furthermore, two simple yet effective mitigation strategies—question repeating and few-shot SFT—are proposed.
Understanding the Repeat Curse in Large Language Models from a Feature Perspective: Investigates the repeat curse in LLMs from the perspective of mechanistic interpretability. Specifically, Sparse Autoencoders (SAEs) are used to extract monosemantic features to locate "repeat features" in the middle and final layers. Activating these features induces repetition, whereas turning them off mitigates repetition without compromising model performance.
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education: This paper proposes a multi-style multimodal retrieval task and dataset SER (24,000+ query pairs) designed for STEM education scenarios, alongside a lightweight retrieval model Uni-Retrieval based on a Prompt Bank. It extracts query style features through prototype learning and dynamically selects prompt vectors to enhance retrieval performance across various styles (text, sketch, art, low-resolution, speech), surpassing existing methods on both STEM education retrieval and traditional retrieval datasets.
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights: This study is the first to systematically demonstrate that LLMs aligned with Schwartz values carry unintended safety risks—specifically, certain value dimensions are significantly correlated with distinct safety risk categories. Drawing from psychological perspectives, this paper explains the origins of these associations and proposes a mitigation strategy that suppresses the relevant values via prompting to effectively reduce harmful behaviors.
ScaleQuest: Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch: Proposes ScaleQuest, which transforms a 7B problem-solving model into a question-generation model via a two-stage training of Question Fine-Tuning (QFT) + Question Preference Optimization (QPO). It synthesizes 1 million high-quality math question-answer pairs from scratch, comprehensively outperforming all open-source datasets across four benchmarks, with performance continuing to rise and showing no saturation as data scales to 1M.
Unlocking Recursive Thinking of LLMs: Alignment via Refinement: This work proposes AvR (Alignment via Refinement), a two-stage framework that leverages refinement-aware rewards and differential learning to equip LLMs with "critique $\rightarrow$ refinement" recursive thinking capabilities. Using only 10k data points, it improves the win rate of LLaMA-3-8B-Instruct on AlpacaEval 2 by over 26 percentage points.
UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization: This work proposes UnSeenTimeQA, a time-sensitive question-answering benchmark based on synthetic facts (rather than real-world events) that eliminates the risk of data contamination by avoiding web-searchable queries. It designs three categories of temporal reasoning questions to evaluate the true temporal reasoning capabilities of LLMs, revealing that LLMs perform poorly on long-range event dependencies and parallel event reasoning.
Unveiling Dual Quality in Product Reviews: An NLP-Based Approach: This paper proposes an automated detection task for "dual quality" (DQ) in product reviews, constructing the first Polish DQ dataset (1,957 reviews) through an iterative active learning strategy. It systematically compares three classes of approaches—SetFit, Transformer encoders, and LLMs—finding that language-specific encoders perform comparably to LLMs with instructions (DQ F1 ≈ 80-83%), and validates cross-lingual transfer capabilities.
Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items: This paper proposes "Value Portrait," a benchmark for assessing the value orientation of 44 LLMs. Grounded in psychometric validation (correlating each test item with actual human value scores) and ecological validity (using real-world user-LLM interaction scenarios), the benchmark reveals that LLMs generally prioritize benevolence, security, and self-direction while exposing cognitive value biases toward different demographic groups.
Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning: This paper reveals that LLMs exhibit a "Veracity Bias" in reasoning tasks. Despite explicit alignment against stereotypes, LLMs systematically attribute correct answers to specific ethnic groups (attribution bias) and evaluate the same solution differently depending on the "author's" race (evaluation bias). This bias is prevalent across mathematics, coding, commonsense reasoning, and writing tasks.
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System: Proposed the VirSci multi-agent system, which constructs a virtual research ecosystem using real scientist data, generating scientific ideas through a 5-step collaborative workflow and an innovative inter- & intra-team discussion mechanism, significantly outperforming single-agent systems in novelty and potential impact.
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models: WarriorCoder is proposed, which builds an arena among multiple expert code LLMs. An attacker challenges a defender with instructions from its own domain of expertise. A judge evaluates the responses, and the target model is trained on the winning answers. This generates high-quality and highly diverse code training data from scratch without relying on proprietary models or pre-existing datasets, achieving state-of-the-art (SOTA) performance.
What Happened in LLM Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective: This paper systematically investigates the behavioral differences of LLM layers when trained on fast thinking (no/short CoT) vs. slow thinking (detailed CoT) data from a gradient perspective. It reveals that slow thinking training leads to more uniform and stable gradients across layers, whereas fast thinking leads to larger gradients and more severe layer-wise fluctuations. Furthermore, the gradient patterns of slow thinking can distinguish correct from irrelevant reasoning paths.
When Large Language Models Meet Speech: A Survey on Integration Approaches: A systematic survey of integration approaches between speech and Large Language Models (LLMs), categorizing existing works into three major paradigms: text-based, latent-representation-based, and audio-token-based integrations. It covers application scenarios including ASR, S2TT, S2ST, and TTS, and provides comparisons of the strengths/weaknesses of each approach alongside future challenges.
When to Speak, When to Abstain: Contrastive Decoding with Abstention: Proposes CDA (Contrastive Decoding with Abstention), a training-free decoding method. By using entropy-calibrated uncertainty estimation, CDA enables LLMs to generate correct answers when parametric/contextual knowledge is available, and to actively abstain when both are unreliable, covering all four knowledge availability scenarios.
Which Demographics Do LLMs Default to During Annotation?: By comparing the annotation behavior of LLMs under three prompt conditions—no demographic information (N), socio-demographic (SD), and placebo information (P)—this study reveals that in subjective annotation tasks (offensiveness/politeness), LLMs default to annotation patterns that align more closely with white, young, and highly educated cohorts. Furthermore, socio-demographic prompting indeed exerts a more systematic influence compared to placebo prompting.
Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph: This work re-interprets and traces the origin of a known but forgotten pre-LLM era conclusion—multi-layer autoregressive Transformer language models do not require explicit positional encodings to distinguish permuted sequences, because cascaded (permutation-invariant) set processors collectively exhibit full position-sensitivity under causal masking; it also reflects on the knowledge gap and citation bias of the LLM era.
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement: This paper identifies a "discriminative-generative safety gap" where LLMs as discriminators can accurately identify jailbreak requests but still produce toxic content when acting as generators. It proposes a training-free strategy called SAGE (Self-Aware Guard Enhancement), which bridges the model's safety discriminative capability to its generative behavior via a discriminative analysis module and a discriminative response module, achieving an average defense success rate of 99% across six models.
Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs: Analyzes the mechanism of prompts in LLM reasoning from a theoretical perspective—proving that prompts act as "selectors" to extract task-relevant information from hidden states and define trajectories within the answer space. It analyzes the complexity of the optimal prompt search space and experimentally demonstrates that optimal prompt search can lead to a 50%+ improvement in reasoning performance.
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region: Reveals a ubiquitous phenomenon in safety-aligned LLMs where safety mechanisms are excessively anchored in the chat template region (TASA). Consequently, jailbreak attacks can bypass safety guardrails by interfering with information processing within this template region. A defense strategy is proposed by migrating safety probes from the template region to the generation phase.
X-Turing: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents: This paper proposes the X-Turing framework, which enhances and streamlines the Turing Test by introducing a burst dialogue mode and pseudo-dialogue generation technology. It evaluates the human-mimicking capabilities of LLMs in long-term dialogues, revealing a significant performance drop as the number of dialogue turns increases.
Zero-Shot Belief: A Hard Problem for LLMs: This paper proposes two zero-shot frameworks, Unified and Hybrid, for source-and-target belief prediction. The hybrid approach utilizes a fine-tuned DeBERTa for event detection combined with an LLM for belief annotation, setting a new SOTA with 72.0% Full F1 on FactBank. Additionally, it highlights nested belief performance for the first time (reporting a low Nested F1 of only 25.3%), revealing that this sub-task remains a significant challenge for all current LLMs.