ACL2025 Others AI paper notes paper summaries Dialogue Reasoning Alignment/RLHF Few-/Zero-Shot Learning Agents Layout & Composition

📂 Others¶

💬 ACL2025 · 184 paper notes

📌 Same area in other venues: 📷 CVPR2026 (105) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (121)

🔥 Top topics: Dialogue ×9 · Reasoning ×8 · Alignment/RLHF ×7 · Few-/Zero-Shot Learning ×4 · Agents ×4

Barec: A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment: This work constructs Barec—the first large-scale, balanced, and fine-grained Arabic readability assessment corpus containing over 69K sentences, 1M words, and 19 grading levels, annotated by 6 professional educators. It benchmarks 4 Arabic BERT models × 4 input variants × 5 loss functions, revealing that the morphological tokenization input D3Tok combined with regression loss achieves a QWK of 84.0%.
A Little Human Data Goes A Long Way: Through large-scale experiments on 8 fact verification and QA datasets, it is demonstrated that mixing a very small amount of human-annotated data (even as few as 125 samples) into synthetic data significantly improves model performance. Replacing the final 10% of human data leads to severe performance degradation, and the performance gain from just 200 human samples requires orders of magnitude more synthetic data to match.
A Measure of the System Dependence of Automated Metrics: Points out the ignored "system dependence" issue in machine translation automated evaluation metrics: the same metric score corresponds to different human ratings for different translation systems. The paper proposes the SysDep metric to quantify this effect, revealing that even the best WMT23 metric, XCOMET, exhibits severe system dependence that leads to incorrect rankings.
A Multi-Persona Framework for Argument Quality Assessment: This paper proposes the MPAQ framework, which simulates multiple distinct evaluator perspectives (personas) using Large Language Models to conduct multi-aspect quality assessment of arguments. It designs a coarse-to-fine scoring strategy (first integer, then decimal), significantly outperforming existing baselines on the IBM-Rank-30k and IBM-ArgQ-5.3k datasets while providing interpretable multi-perspective explanations.
A New Formulation of Zipf's Meaning-Frequency Law through Contextual Diversity: This paper proposes to reformulate Zipf's meaning-frequency law as a power-law relationship between word frequency and contextual diversity. It quantifies the number of word meanings through the directional distribution of contextualized word vectors generated by language models. The findings reveal that this law is unobservable in small-scale language models, and autoregressive LMs require significantly more parameters than masked LMs to exhibit the law.
A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs: A directed acyclic graph (DAG) based workflow framework is proposed. By decomposing the complex business constraints of an LLM agent into different state nodes in the graph and combining this with a response masking fine-tuning strategy, a production-grade e-commerce conversational agent is built. It significantly outperforms the GPT-4o baseline in both task accuracy and format adherence.
A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior: This paper proposes a unified probabilistic model of reading behavior based on marked spatio-temporal point processes. It simultaneously models when and where fixations occur and how long they last, avoiding the information loss associated with traditional aggregated measures, and reveals that surprisal has an extremely limited contribution to predicting fine-grained eye movements.
ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Clause Retrieval: Builds the first expert-annotated clause retrieval benchmark for contract drafting, ACORD (114 queries, 126K+ pairs, 1-5 star ratings). Evaluating 20 retrieval methods reveals that BM25 + GPT-4o pointwise reranking performs best (NDCG@5 = 76.9%), but the accuracy for high-quality clauses is extremely low (5-star precision@5 is only 17.2%), highlighting a significant gap between models and the actual needs of lawyers.
Adaptive Feature-based Low Rank Plus Sparse Decomposition for Subspace Clustering: This paper proposes an adaptive feature-driven low-rank plus sparse matrix decomposition method. By adaptively learning the weights of low-rank and sparse components in the feature space, it addresses the issues of noise robustness and insufficient feature discriminability in subspace clustering.
Adaptive Retrieval without Self-Knowledge? Bringing Uncertainty Back Home: This work conducts a comprehensive evaluation of 35 adaptive retrieval methods (including 8 state-of-the-art methods and 27 uncertainty estimation methods), revealing that classic uncertainty estimation techniques often outperform complex, specialized pipelines in terms of efficiency and self-knowledge capability, while maintaining comparable QA performance.
Advancing Sequential Numerical Prediction in Autoregressive Models: Proposes Numerical Token Integrity Loss (NTIL)—a two-level numerical prediction loss function. At the token level, it replaces cross-entropy with exponentially position-weighted Earth Mover's Distance (EMD) to maintain numerical order. At the sequence level, it penalizes overall numerical deviation through differentiable numerical construction. This approach significantly improves the numerical prediction accuracy of autoregressive models across tasks such as object detection, text detection, mathematical reasoning, and clock recognition.
AIDE: Attribute-Guided Multi-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning: This paper proposes the AIDE framework, which generates around 3K high-quality task-specific training data points from only 10 seed samples through a multi-hop data expansion mechanism of "attribute guidance + Persona enhancement + residual connections." Fine-tuning Mistral-7B on this data outperforms human-annotated data fine-tuning by an average of 6% and SOTA methods like Evol-Instruct by 30% under zero-shot settings.
ALGEN: Few-Shot Inversion Attacks on Textual Embeddings via Cross-Model Alignment: This paper proposes ALGEN, a few-shot textual embedding inversion attack method. By linearly aligning the victim's embedding space with the attacker's embedding space and then using a trained embedding-to-text generator to reconstruct the original text, a partially successful attack can be launched with only 1 leaked sample, achieving a Rouge-L of 45.75 with 1,000 samples.
An Analysis of Datasets, Metrics and Models in Keyphrase Generation: Conducts a systematic analysis of 50+ papers in the keyphrase generation field, revealing critical issues such as high similarity among benchmark datasets and inconsistent evaluation metric computations leading to overestimated performance, and releases a strong PLM-based model to facilitate future research.
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs: By training GPT-2 small on 12 languages, this study systematically tests whether language models (LMs) can distinguish possible languages (natural languages) from impossible ones (scrambled word orders, etc.). The findings reveal that LMs exhibit partial human-like learning biases but are not perfect—they can differentiate within a single language but fail to achieve complete separation cross-linguistically. Furthermore, in noun phrase (NP) word order experiments, generalization testing (rather than perplexity) is found to reflect typological preferences.
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?: This paper proposes the ACON dataset and three consistency evaluation criteria (cyclic consistency, forward equivariance, and conjugated equivariance), finding that current any-to-any models are not more cross-modally consistent in point-wise evaluations than combinations of specialist models, though weak consistency can be observed through distributional analyses of multiple editing operations.
Attention Entropy is a Key Factor for Parallel Context Encoding: This paper discovers that parallel context encoding leads to an abnormal increase in the attention entropy of query tokens, which is a key factor for performance degradation. Two training-free methods, Shared Attention Sink and Selective Attention, are proposed to effectively mitigate this issue.
Autalic: A Dataset for Anti-Autistic Ableist Language In Context: This paper proposes Autalic, the first dataset dedicated to detecting anti-autistic ableist language in context. It contains 2,400 Reddit sentences annotated with context by experts from neurodivergent backgrounds. Experiments reveal that current LLMs (including DeepSeek, Llama3, Gemma2, and Mistral) exhibit severe disagreement with human judgment when identifying anti-autistic ableism (with an average Cohen's Kappa of only 0.091), highlighting the difficulty of this task.
AutoMixer: Checkpoint Artifacts as Automatic Data Mixers: Proposed the AutoMixer framework, which leverages checkpoint models saved during training as "data mixers" to regroup and reweight training data by aggregating first-order influence function approximations across multiple checkpoints, achieving a performance improvement of up to 1.93% across eight reasoning benchmarks.
Battling against Tough Resister: Strategy Planning with Adversarial Game for Non-collaborative Dialogues: This paper proposes a strategy planning framework based on adversarial games to address strategy selection when facing a tough opponent in non-collaborative dialogues (e.g., persuasion, negotiation), generating more effective persuasive strategies by modeling the adversarial dynamics between both parties.
Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey: This opinionated survey distinguishes between behavioural systematicity (whether a model can generalize correctly to new combinations) and representational systematicity (whether internal representations are structurally compositional). Using Hadley's three-level classification (weak, quasi, and strong), the authors review mainstream benchmarks in both the language and vision domains, revealing that most existing benchmarks only test weak or quasi-systematicity, and call for bridging the gap between behavioural and representational evaluation through mechanistic interpretability methods.
Beyond Position: the emergence of wavelet-like properties in Transformers: Through frequency analysis and wavelet decomposition, this work reveals that attention heads in Transformer models using RoPE positional encodings spontaneously develop wavelet-like multi-resolution processing properties to compensate for the inherent trade-off between positional precision and frequency resolution in RoPE.
Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference: Proposal of Bregman CRF (Bcrf), a novel discriminative model for sequence labeling based on mean regularization. It implements a parallelizable inference algorithm using iterative Bregman projection to replace the inherently sequential Viterbi/Forward algorithms in traditional CRFs. Bcrf achieves performance on par with standard CRFs on POS, NER, and word segmentation tasks while being faster, and outperforms Mean Field methods in scenarios with forbidden label transition constraints.
CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction: Proposes the CAD program review task and the ReCAD framework, which automatically detects errors in CAD programs and generates correction feedback based on reference images, constructing the CADReview dataset containing 20K+ samples (8 error classes).
Can Uniform Meaning Representation Help GPT-4 Translate from Indigenous Languages?: This paper explores incorporating Uniform Meaning Representation (UMR) semantic graphs into GPT-4 prompts for translating three indigenous languages (Navajo, Arapaho, and Kukama), finding that the addition of UMR leads to statistically significant performance gains in most cases.
Capacity Matters: A Proof-of-Concept for Transformer Memorization on Real-World Data: Using the SNOMED clinical knowledge graph as the data source, this paper systematically investigates the memorization capacity of decoder-only Transformers on structured data. It reveals that the embedding dimension is the primary factor determining learning speed and capacity, while increasing model depth yields marginal returns, and the Softmax activation function exhibits the most stable performance.
Causal Estimation of Tokenisation Bias: This paper formalizes the impact of tokenizer choice on language model outputs as "tokenisation bias" for the first time. It utilizes Regression Discontinuity Design (RDD) from causal inference to quantify this effect, revealing that when a subword is included in the vocabulary, the probability of its corresponding string can increase by up to 17 times (for smaller models). This indicates that tokenization is an underestimated, critical design choice in language modeling.
Cautious Next Token Prediction: Proposes Cautious Next Token Prediction (CNTP), a training-free adaptive decoding strategy. When the model's prediction entropy is high (showing uncertainty), it samples multiple candidate paths up to a punctuation mark, then selects the path with the lowest perplexity as the final generation. This significantly improves accuracy without sacrificing diversity.
ChuLo: Chunk-Level Key Information Representation for Long Document Understanding: The core of ChuLo is not simply partitioning long documents into smaller chunks, but rather identifying the most critical semantic phrases globally first, and then re-injecting this key information into each chunk's representation. This preserves both global semantics and fine-grained token information while using compact chunk representations.
CiteEval: Principle-Driven Citation Evaluation for Source Attribution: This paper proposes CiteEval, a principle-driven framework for citation evaluation. By considering the entire retrieval context, multiple contexts beyond retrieval, and fine-grained evaluation criteria, the authors construct the CiteBench benchmark and the CiteEval-Auto automatic metric, which significantly outperform existing NLI-based methods in citation quality evaluation.
CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification: This paper explores three progressive model architectures for the SemEval-25 Task 6 (PromiseEval) on corporate ESG report promise verification: an ESG-BERT baseline, a linguistic feature-enhanced version, and a joint sub-task model integrating attention pooling and multi-objective learning. The proposed approach slightly outperforms the official baseline on the private leaderboard (0.5268 vs 0.5227), validating the effectiveness of linguistic feature engineering and multi-task learning in ESG promise verification.
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model: CoachMe is proposed to automatically generate sports-specific coaching instruction texts by comparing the differences (across both temporal and physical dimensions) between the learner's motion and a reference motion, outperforming GPT-4o by 31.6% in figure skating and 58.3% in boxing (according to G-Eval).
CoAM: Corpus of All-Type Multiword Expressions: Constructed CoAM (1.3K sentences), a high-quality, all-type Multiword Expression (MWE) identification dataset. Through a multi-step quality assurance pipeline, this work addresses the annotation inconsistency issues in existing datasets. It also demonstrates that fine-tuning Large Language Models (LLMs) significantly outperforms the previous SOTA method, MWEasWSD, on the MWE identification task.
Code-Switching and Syntax: A Large-Scale Experiment: Through large-scale, multilingual, and cross-phenomenon experiments, this study systematically validates the linguistic consensus that "syntactic information is sufficient to explain code-switching (CS) patterns" for the first time. Using only syntactic features, the model achieves judgment accuracy comparable to bilingual humans, and the learned syntactic patterns generalize to unseen language pairs.
Completing A Systematic Review in Hours instead of Months with Interactive AI Agents: This paper proposes InsightAgent, a human-centric interactive multi-agent system that reduces the drafting time of medical systematic reviews from months to approximately 1.5 hours through semantic clustering partitioning, multi-agent parallel reading, and real-time user interaction, achieving 79.7% of human drafting quality.
CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions: CONFETTI proposes a function-calling evaluation benchmark for multi-turn conversational scenarios, containing 109 human-simulated conversations, 313 user turns, and 86 APIs. Through off-policy turn-level evaluation and dialog act annotation, it systematically tests the tool-calling capability of LLMs in complex conversational scenarios. The study reveals that even the strongest model (Nova Pro) only achieves around 40% accuracy, with chained calling being a universal weakness.
Consistent Client Simulation for Motivational Interviewing-based Counseling: This paper proposes a consistent client simulation framework for motivational interviewing (MI) psychological counseling. Through four modules—state transition, action selection, information selection, and response generation—it ensures that the behavior of simulated clients aligns with their predefined profiles (motivations, beliefs, change plans, receptivity), outperforming baseline methods in both automatic and expert evaluations.
Contextual Experience Replay for Self-Improvement of Language Agents: CER (Contextual Experience Replay) proposes a training-free self-improvement framework for language agents. By accumulating and legacy-synthesizing past interaction experiences into a dynamic memory buffer, it allows the agent to retrieve relevant knowledge during inference to enhance decision-making on new tasks, achieving a 51.0% relative success rate improvement over the GPT-4o baseline on WebArena.
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter: CORAL improves the feature consistency of the draft model in multi-step training via Cross-Step Representation Alignment (CSRA), and compresses the inference latency of the large-vocabulary LM head via a weight grouping mechanism, achieving a \(2.50\text{--}4.07\times\) speedup on LLaMA3/Qwen2.5, outperforming EAGLE-2 and HASS.
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity: By compressing text into trainable [mem] vectors via a per-sample optimization method, it is discovered that Llama-3.1-8B can losslessly compress 1568 tokens into a single input vector. This reveals a gap of two orders of magnitude between existing methods (approx. x10 compression ratio) and the practically achievable limit (x1500+).
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation: This paper treats the attention scores of Transformers as feature maps. By applying convolutional operations on these attention scores (instead of relying solely on simple key-query dot products), it significantly enhances the length extrapolation capability of Transformers on long sequences, transforming the extrapolation problem into a classic image feature processing problem.
Decoding Reading Goals from Eye Movements: This paper introduces the novel task of decoding readers' reading goals (information seeking vs. ordinary reading) from eye movement trajectories. Through a systematic comparison of 12 models, a Transformer-based scanpath and language modeling approach (RoBERTa-Eye-F) is found to be optimal, achieving high-accuracy, real-time prediction early in the reading process.
DeepRTL2: A Versatile Model for RTL-Related Tasks: DeepRTL2 is the first LLM to unify the processing of both RTL (Register-Transfer Level) generation and embedding tasks. Through a meticulously constructed dataset and the GRIT training strategy, it achieves SOTA performance across five major tasks: code generation, code understanding, natural language code search, functional equivalence checking, and performance prediction.
Detecting Sockpuppetry on Wikipedia Using Meta-Learning: This paper applies meta-learning to the malicious sockpuppet detection task on Wikipedia. By optimizing the model's rapid adaptation capability through training across multiple sockpuppet groups, it significantly improves detection accuracy in data-scarce scenarios and releases a new sockpuppet investigation dataset.
Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition: Inspired by the "Less-is-More" hypothesis, this study proposes the DynamicLimit-Exp method, which integrates the exponential growth characteristics of human working memory during the critical period into language model training (by dynamically adjusting ALiBi slopes). GPT-2 models trained on Child-Directed Speech data using this method significantly outperform baselines without memory constraints and those with static constraints in syntactic evaluation.
Digital Gatekeepers: Google's Role in Curating Hashtags and Subreddits: By comparing hashtag and subreddit search results returned by Google with non-sampled ground-truth data from Reddit and Twitter/X, this paper reveals that Google's algorithms systematically suppress content related to pornography, conspiracy theories, advertising, and cryptocurrency while promoting highly engaging content, thereby acting as a "digital gatekeeper" that shapes public discourse.
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction: This paper proposes a three-step pipeline that predicts student choice preferences via a pairwise ranker and subsequently trains a distractor generator using DPO, rendering the generated multiple-choice question (MCQ) distractors more plausible and discriminative.
Do not Abstain! Identify and Solve the Uncertainty: This paper proposes the ConfuseBench benchmark and a method for identifying uncertainty sources based on the uniqueness of the inquiry answer. It introduces InteractDPO to dynamically generate preference pairs during policy training to enhance inquiry quality, enabling LLMs to proactively identify and resolve uncertainty rather than simply abstaining.
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback: Proposes Dolphin, a closed-loop auto-research framework that incorporates a three-stage cycle of "idea generation \(\rightarrow\) experimental verification \(\rightarrow\) results feedback". Through task-attribute-guided paper ranking and exception-traceback-guided debugging processes, Dolphin automatically proposes and verifies methods that approach human-designed SOTA on tasks such as 3D classification.
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing: This paper releases DREsS, a large-scale standardized rubric-based essay scoring dataset containing three sub-datasets (DREsS_New with student classroom data of 1.7K + DREsS_Std with standardized historical dataset of 6.5K + DREsS_CASE with augmented data of 40.1K), and proposes a corruption-based essay augmentation strategy (CASE) that improves the BERT baseline QWK score from 0.471 to 0.685 (a Gain of 45.44%).
DRS: Deep Question Reformulation With Structured Output: Proposes DRS (Deep Question Reformulation with Structured Output), a zero-shot method that improves the question reformulation accuracy of GPT-3.5 from 23.03% to 70.42% through entity-driven DFS search and structured output constraints. This enables LLMs to effectively help users convert unanswerable questions into answerable counterparts.
Enhancing Transformers for Generalizable First-Order Logical Entailment: A systematic study on the generalizable reasoning capability of Transformers in first-order logical entailment tasks, revealing the impacts of query syntax, token embeddings, and Transformer architectures (especially position embeddings), and proposing TEGA (Transformer Encoder with Guided Attention) to significantly improve logical reasoning performance under relative position encoding settings.
Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments: This paper proposes a confidence modeling approach based on Kernel-Weighted Ordinal Classification Cross-Entropy (KWOCCE). By leveraging the ordinal structure of CEFR levels and a score binning strategy, this method achieves up to 47% of scores released under 100% CEFR consistency, and 99% released under \(\ge 95\%\) consistency, which is a significant improvement over the approximately 92% consistency obtained without confidence filtering.
Entailed Between the Lines: Incorporating Implication into NLI: This work formalizes the task of "implied entailment," expanding the traditional three-way classification of NLI into a four-way classification (implied entailment/explicit entailment/neutral/contradiction). It constructs the INLI dataset, which comprises 10K premises and 40K hypotheses. Experiments demonstrate that fine-tuned models can effectively identify implied entailment and generalize across domains.
Entropy-UID: A Method for Optimizing Information Density: Proposed the Entropy-UID method, which jointly minimizes a weighted combination of entropy and surprisal during the decoding process of autoregressive language models to achieve a uniform distribution of information density. On the WikiText-2, OpenWebText, and WMT datasets, this method achieves the lowest entropy standard deviation (\(\approx 2.8\)) and stable surprisal (\(\approx 5.7\)), outperforming single-objective optimization strategies.
EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding: This paper proposes EpiCoDe, a training-free method combining Model Extrapolation and Contrastive Decoding. It enhances the performance of fine-tuned models in data-scarce scenarios through parameter-space extrapolation and inference-time logit contrast, while providing a theoretical analysis framework from the perspective of logit errors.
ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation: ERU-KG proposes an unsupervised keyphrase generation framework consisting of an informativeness module and a phraseness module. It learns term-level informativeness estimation through reference texts (queries, citation contexts, titles), outperforming all unsupervised baselines and achieving 89% of the performance of supervised models on keyphrase generation benchmarks, while obtaining the fastest inference speed.
Evaluating Design Decisions for Dual Encoder-based Entity Disambiguation: This work systematically evaluates key design choices of Dual Encoders in Entity Disambiguation (ED) tasks (loss functions, similarity measures, label verbalization formats, and negative sampling strategies). Based on the optimal design, the VerbalizED system is constructed, achieving a new SOTA on the ZELDA benchmark. It also explores an iterative prediction strategy to leverage already disambiguated neighboring entities to improve difficult samples.
Evaluating the Evaluation of Diversity in Commonsense Generation: A systematic meta-evaluation of 12 diversity evaluation metrics in Generative Commonsense Reasoning (GCR) tasks reveals that form-based (n-gram) metrics severely overestimate diversity on low-quality generations, while content-based (sentence embedding) metrics align better with human judgments. Consequently, content-level metrics such as VS-Embed or Chamfer Distance are recommended for future GCR research.
Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection: Addressing the issues of data sparsity and fine-grained classification ambiguity in online sexism detection, this paper proposes two prompt-based data augmentation techniques—Definition-driven Data Augmentation (DDA) and Contextual Semantic Expansion (CSE). DDA leverages category definitions to generate semantically aligned synthetic samples, while CSE enriches training data by analyzing the semantic features of model errors. Combining these with a Mistral-7B fallback ensemble strategy, this approach chapters SOTA performance on all tasks on the EDOS dataset.
FastMCTS: A Simple Sampling Strategy for Data Synthesis: FastMCTS proposes a lightweight reasoning data synthesis strategy inspired by MCTS. Through three key improvements—adaptive stay policy, dynamic exploration, and reserve simulation—it generates over 30% more correct reasoning paths than rejection sampling, achieving an average improvement of 3.9% across multiple mathematics benchmarks on the trained model.
FRACTAL: Fine-Grained Scoring from Aggregate Text Labels: Proposes the FRACTAL method, which decomposes response-level aggregate labels into sentence-level pseudo-labels. By combining multi-instance learning (MIL) and learning from label proportions (LLP) techniques with prior information (document-sentence cosine similarity), it trains a sentence-level scoring model covering four types of tasks: retrieval, question answering, summarization, and mathematical reasoning.
Frictional Agent Alignment Framework: Slow Down and Don't Break Things: Proposes the Frictional Agent Alignment Framework (FAAF). By employing a two-player (frictive state policy + intervention policy) objective function, FAAF trains LLMs to detect belief conflicts in collaborative dialogues and generate "frictive" interventions that encourage reflection and deliberation, outperforming alignment methods such as DPO, IPO, and PPO.
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding: This paper proposes the "Attributed Grounding" framework, which constructs the SynthQuestions dataset consisting of 1 million diverse and complex instructions through top-down user attribution and bottom-up web-document-based instruction synthesis, enabling trained models to achieve state-of-the-art performance across multiple general benchmarks.
GA-S3: Comprehensive Social Network Simulation with Group Agents: This paper proposes GA-S3, a social network simulation system based on "Group Agents". It aggregates individuals with similar behaviors into group agents, achieving efficient and accurate simulation of large-scale social networks through hierarchical generation, Markov network reasoning, and behavior modules.
Generating Synthetic Relational Tabular Data via Structural Causal Models: This paper extends the TabPFN approach for synthetic data generation based on Structural Causal Models (SCMs), proposing a framework capable of generating multi-table relational synthetic tabular data by coupling nodes and latent causal relationships to model cross-table dependencies.
GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns: GeNRe is the first French gender-neutral rewriting system. It leverages collective nouns to replace masculine generics, proposing three approaches: a rule-based system, fine-tuned models, and an instruction-based model. Among these, the rule-based system and the Claude 3 Opus + dictionary approach yield the best performance.
GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes: An 8-week randomized controlled trial (RCT) was conducted in an English as a Second Language (ESL) course at an Italian technical high school, replacing traditional homework with GPT-4 as an interactive tutoring tool. The evaluation found that the GPT-4 group improved in student engagement (significant increases in interest and helper sufficiency) and learning gains under specific conditions (Grade 3 Cohen's \(d = 0.603\)). The system is highly practical, requiring teachers to provide only the homework targets and descriptions. It maintained a hallucination rate of less than 1%, and all participating students expressed a desire to continue using it.
Graph-Structured Trajectory Extraction from Travelogues: This paper proposes the "Visiting Order Graph" to unify the geographic containment hierarchy and chronological transition relations in travel trajectories. It constructs the ATD-VSO benchmark dataset covering 100 Japanese travelogues (consisting of 3,354 geographic entities and 3,369 relations). Baseline experiments reveal that geographic inclusion relation prediction (\(F1=0.355\)) is the primary bottleneck, pointing out a key direction for integrating geographic knowledge in this field.
Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights: This paper proposes a context-aware abusive language detection framework based on Graph Attention Networks (GAT). It models Reddit conversations as graph structures (nodes = comments, edges = reply relations) and utilizes an affordance-based graph pruning strategy derived from Reddit’s interface rendering logic to preserve key context. A 3-layer GAT model achieves an F1 score of 0.7624, significantly outperforming no-context baselines and flattened context methods, with a particularly pronounced improvement (+4.75%) on context-sensitive samples.
Guidelines for Fine-grained Sentence-level Arabic Readability Annotation: This paper proposes the BAREC corpus and its annotation guidelines, which represent a large-scale Arabic sentence-level readability evaluation resource containing over 69K sentences across 19 readability levels, and establishes benchmark models for automated readability assessment based on this resource.
Hanging in the Balance: Pivotal Moments in Crisis Counseling Conversations: This paper proposes an unsupervised method to detect "pivotal moments" in conversations—points where the next response can dramatically affect the outcome—and validates its effectiveness in crisis counseling scenarios.
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems: This paper proposes a scalable hard negative mining framework for enterprise domain-specific retrieval. By fusing multiple embedding models, PCA dimensionality reduction, and dual semantic conditional filtering, the framework dynamically selects high-quality hard negatives, achieving significant improvements on both internal cloud service datasets and public benchmarks.
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference: HATA proposes a method that integrates learning-to-hash technology into the top-k attention mechanism. By mapping queries and keys to binary hash codes to retrieve relative qk score rankings (rather than absolute score estimations), it achieves up to a 7.2x speedup compared to full attention while maintaining model accuracy.
HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling: NVIDIA releases the HelpSteer3 dataset (annotated by over 7,000 annotators from over 80 countries) to train specialized Feedback and Edit models. During inference, these models establish an "initial response \(\rightarrow\) feedback \(\rightarrow\) edit" loop to enable inference-time scaling for open-domain general tasks. Based on the Llama 3 series 70B model, this method achieves a score of 92.7 on Arena Hard, outperforming OpenAI o1-preview (90.4) and DeepSeek R1 (92.3).
Hierarchical Bracketing Encodings for Dependency Parsing as Tagging: Proposes a family of hierarchical bracketing encodings for the sequence tagging paradigm of dependency parsing, proves that the existing 4-bit encoding is a non-optimal special case of this family, derives an optimal encoding requiring only 12 tags, and extends it to handle arbitrary non-projectivity.
Hierarchical Memory Organization for Wikipedia Generation: Proposes the Memory Organization-based Generation (MOG) framework, which extracts fine-grained memory units (factoids) from web documents and organizes them into a hierarchical Wikipedia outline structure via a recursive clustering-summarization algorithm. This ensures that every section is directly supported by memory. It comprehensively outperforms RAG and STORM baselines in terms of informativeness, citation rate, and verifiability on the FreshWiki and WikiStart datasets.
Counterspeech the Ultimate Shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning: Proposed the HiPPrO two-stage framework for multi-conditioned counterspeech generation. The first stage optimizes counterspeech generation in multiple attribute (strategy + emotion) spaces through hierarchical prefix learning, and the second stage enhances constructiveness using reference-free and reward-free preference optimization. Strategy consistency increases by ~38%, and ROUGE metrics improve by 2-3%.
How to Mitigate Overfitting in Weak-to-Strong Generalization?: A two-stage training framework is proposed to address the overfitting issue in weak-to-strong generalization. The first stage enhances the quality of weak supervision signals through uncertainty-based filtering, while the second stage utilizes the fine-tuned strong model to regenerate answers for discarded hard problems to restore problem quality. This approach improves the PGR from 7.19% to 120.50% on GSM8k and MATH.
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback: This paper proposes HyPER (Hybrid Preference Router), which dynamically decides whether each annotation instance should receive human or AI preference feedback by training a performance prediction model. It achieves a 7-13% improvement on RewardBench compared to pure human or pure AI annotation, while significantly reducing annotation costs.
I0T: Embedding Standardization Method Towards Zero Modality Gap: The I0T framework is proposed to reduce the modality gap to near zero by identifying and eliminating modality-specific features (manifested as peak activations in normalized embeddings) independently learned by the image and text encoders in CLIP. It maintains or even improves downstream task performance and introduces I0T-Score, an automatic evaluation metric that is more interpretable than CLIPScore.
If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?: This paper investigates whether the attention mechanism of Transformer Grammar (TG) can serve as a cognitive model for human memory retrieval. By relating the model to human reading times using Normalized Attention Entropy (NAE), the study reveals that syntax-based attention explains human sentence processing behavior better than token-based attention, and both make independent, complementary contributions.
Predicting Implicit Arguments in Procedural Video Instructions: The Implicit-VidSRL dataset and the iSRL-Qwen2-VL model are proposed to address the prediction of omitted implicit arguments (ingredients) in procedural video instructions. By decomposing multi-step instructions into {verb, what, where/with} triplets using a semantic role labeling (SRL) framework, the model, after being fine-tuned on silver-standard data, outperforms GPT-4o by 17% in implicit argument F1.
Implicit Reasoning in Transformers is Reasoning through Shortcuts: By training GPT-2 from scratch on controlled multi-step mathematical reasoning datasets, this paper systematically investigates the implicit reasoning mechanisms of language models. It reveals that implicit reasoning is fundamentally shortcut learning based on pattern matching—generalizing well on fixed-pattern data but overfitting on unfixed-pattern data, a finding that also holds true for SOTA large language models.
Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate: To address the semantic gap between queries (concrete instantiated facts) and rules (abstract variable formulations) in rule retrieval, this paper proposes SIAR (Self-Induction Augmented Retrieval) and \(R^3\) (Rule Relevance ReEstimate). By mapping queries into the rule semantic space and re-evaluating rule relevance, these two methods significantly improve both rule retrieval and downstream reasoning performance.
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents: This paper proposes the Reflective Memory Management (RMM) mechanism. By combining prospective reflection (multi-granularity memory summarization) and retrospective reflection (reinforcement learning-driven online retrieval optimization), it builds an efficient memory management framework for long-term personalized dialogue systems, achieving an accuracy improvement of over 10% on LongMemEval.
Inducing Lexicons of In-Group Language with Socio-Temporal Context: This paper proposes the LISTN (Lexicon Induction with Socio-Temporal Nuance) framework, which utilizes dynamic word and user embeddings to jointly model the social structure and temporal evolution of community language. On the task of in-group lexicon induction within the anti-feminist online community (manosphere), LISTN achieves an average precision of 0.77, significantly outperforming existing methods.
Inferring Functionality of Attention Heads from their Parameters: Proposes the MAPS framework, which constructs a token mapping matrix \(M\) by projecting attention head parameters into the vocabulary space. MAPS infers the functions realized by attention heads without requiring any forwarding inference or training. The mapping accuracy is validated across 20 relational operations on 6 LLMs, and an automated pipeline is developed to discover numerous previously unidentified attention head functions.
Infogen: Generating Complex Statistical Infographics from Documents: The Infogen framework is proposed to transform textual documents into complex statistical infographics (combinations of multiple subplots) using a two-stage design: first generating structured intermediate metadata with a fine-tuned LLM, and then iteratively generating final infographic code using an LLM code generator and a feedback module.
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking: Proposes Inner Thinking Transformer (ITT), which dynamically allocates more computational steps to key tokens without increasing parameters through adaptive token routing and residual thinking connections, achieving implicit deep reasoning. With only 162M parameters, it achieves 96.5% of the performance of a 466M Transformer.
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction: Proposes Web Reconstruction (WebR), a framework that fully automatically synthesizes high-quality instruction-tuning data from raw web documents. Through a dual-perspective paradigm of "Web as Instruction" and "Web as Response", it generates IT data superior to existing SOTA without human annotation.
Inter-Passage Verification for Multi-evidence Multi-answer QA: This paper proposes the RI²VER framework to address the multi-answer QA problem. It first generates a high-recall but noisy candidate answer set by independently reading a large number of retrieved passages, and then filters out incorrect answers through inter-passage verification (generating verification questions \(\rightarrow\) collecting additional evidence \(\rightarrow\) cross-passage synthesis verification), improving average F1 by 11.17% on QAMPARI and RoMQA.
Interlocking-free Selective Rationalization Through Genetic-based Learning: This paper proposes GenSPP, the first selective rationalization framework that completely eliminates the interlocking problem. By using genetic algorithms to separately optimize the generator and predictor, it significantly improves rationale quality (Hl-F1 increased by 6.5%–10.3%) on synthetic datasets and hate speech detection tasks, while maintaining comparable classification performance.

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery: Proposes IRIS, an open-source interactive research ideation system that achieves human-machine collaborative scientific hypothesis generation through Monte Carlo Tree Search (MCTS) for test-time compute scaling, fine-grained feedback mechanisms, and query-based literature synthesis.
Is Linguistically-Motivated Data Augmentation Worth It?: This study systematically compares the effectiveness of linguistically motivated and non-linguistic (random perturbation) data augmentation strategies across two low-resource languages, revealing that linguistic approaches yield advantages only when generated samples closely align with the training data distribution, and can otherwise be detrimental.
Knowledge Tracing in Programming Education Integrating Students' Questions: This paper proposes the SQKT (Students' Question-based Knowledge Tracing) model, which is the first to integrate students' questions and automatically extracted skill information into knowledge tracing. It predicts students' completion of subsequent programming problems in programming education, achieving an in-domain AUC improvement of up to 33.1%.
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding: KodCode proposes a three-stage synthetic data pipeline (synthesizing programming problems \(\rightarrow\) solution + unit test self-verification \(\rightarrow\) post-training data synthesis) to construct 447K verified programming question-solution-test triplets. The fine-tuned models outperform Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B on benchmarks such as HumanEval, MBPP, BigCodeBench, and LiveCodeBench.
LAQuer: Localized Attribution Queries in Content-grounded Generation: Proposes the Localized Attribution Queries (LAQuer) task—precisely localizing user-selected segments in generated text to corresponding segments in the source documents. This achieves a finer granularity of provenance than sentence-level attribution and is more user-directed than sub-sentence-level attribution, significantly reducing the attributed text length in multi-document summarization and long-form question answering.
LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models: Proposed LaTIM, a token-level decomposition method tailored for Mamba-1 and Mamba-2 that reconstructs the implicit computations of SSM into a Transformer-like token-to-token contribution matrix, enabling fine-grained interpretability analysis for Mamba models.
Learning to Reason from Feedback at Test-Time: This paper proposes the FTTT (Feedback at Test-Time Training) paradigm, which formulates the environment feedback utilization of LLMs during the inference phase as an optimization problem. It designs a learnable test-time optimizer, OpTune, achieving superior scalability and performance compared to existing feedback utilization methods across four reasoning datasets.
LegalReasoner: Step-wised Verification-Correction for Legal Judgment Reasoning: This paper proposes the LegalReasoner framework to enhance the reliability of legal judgment prediction through dispute point identification, step-by-step reasoning, logical validation of each step using a process verifier, and an expert-designed attribution-based correction strategy. Combined with the newly released LegalHK dataset containing 58,130 Hong Kong court cases, the framework improves the concordance rate with court judgments on LLAMA-3.1-70B from 72.37% to 80.27%.
Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments: This paper presents the first large-scale cross-dataset generalization evaluation of four Transformer models across 17 English sentence-level argument mining datasets. The findings reveal that state-of-the-art models primarily learn dataset-specific lexical patterns rather than the structural signals of arguments, leading to generalization performance far below their in-dataset baselines. However, task-specific pre-training and multi-dataset joint training can partially alleviate this issue.
Map&Make: Schema Guided Text to Table Generation: This work proposes the Map&Make method, which first deconstructs unstructured text into propositional atomic statements (Map phase) and then derives the table schema and populates data based on them (Make phase), significantly improving text-to-table quality and interpretability on both Rotowire and Livesum scenarios.
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus: This study builds and releases SPoRC, a large-scale dataset containing transcripts of 1.1 million podcast episodes (complete with metadata, inferred speaker roles, and acoustic features for 370k episodes). Through topic analysis, guest co-occurrence network analysis, and responsiveness analysis during the George Floyd protests, it provides the first comprehensive characterization of the content, structure, and responsiveness of the podcast ecosystem.
MapQaTor: An Extensible Framework for Efficient Annotation of Map-Based QA Datasets: This paper proposes MapQaTor, an extensible open-source Web framework that integrates multiple map APIs (Google Maps, OpenStreetMap, etc.) to accelerate geospatial QA dataset annotation by at least 30 times, while ensuring data reproducibility through API response caching.
Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks: This paper proposes the ENDow framework, which systematically analyzes the impact of ASR transcription noise on downstream NLU tasks for the first time. By evaluating task models under different noise levels and categories using a configurable pipeline, the authors find that named entities are the most critical word type and that models can tolerate a certain degree of noise.
Meta-Learning Neural Mechanisms rather than Bayesian Priors: Challenges the mainstream view that "meta-learning distills Bayesian simplicity priors in neural networks," demonstrating through formal language experiments that meta-learning actually implants useful 破坏性神经机制 (e.g., counters) in models, rather than learning a preference for simplicity.
MEXMA: Token-level Objectives Improve Sentence Representations: MEXMA is proposed, a cross-lingual sentence encoder training method that combines sentence-level and token-level objectives: using the sentence representation of one language to predict the masked tokens of another language, while allowing gradients from both sentence and token levels to directly update the encoder, outperforming SONAR and LaBSE on bitext mining and multiple downstream tasks.
Minimal Pair-Based Evaluation of Code-Switching: This paper proposes a minimal pair-based evaluation method for code-switching (CS), collecting up to 1000 minimal pairs for each of 11 language pairs. It is found that both bilinguals and large-scale LLMs prefer naturally occurring CS sentences. Furthermore, larger models show more consistent preferences, and the manipulation of closed-class words produces the largest probability differences.
MIR: Methodology Inspiration Retrieval for Scientific Research Problems: This paper defines a new task, Methodology Inspiration Retrieval (MIR), which aims to retrieve papers that provide methodological inspiration for a given scientific research problem. It proposes the Methodology Adjacency Graph (MAG) to capture methodological inheritance relationships, achieving an improvement of +5.4 on Recall@3 and +7.8 on mAP, with an additional +4.5/+4.8 improvement when combined with LLM reranking.
Mitigating Shortcut Learning with InterpoLated Learning: This paper proposes InterpoLated Learning (InterpoLL), which mitigates the model's reliance on shortcut features and significantly improves generalization on minority samples by interpolating the representations of majority samples with those of minority samples from the same class.
MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines: This paper constructs MockConf—a Czech-centric student simultaneous interpreting dataset (7 hours, 5 European languages), featuring manually annotated span-level and word-level alignments. Additionally, the paper releases a dedicated annotation tool, InterAlign, and establishes automatic alignment baselines along with an evaluation metric framework.
Multi-Facet Blending for Faceted Query-by-Example Retrieval: Proposes FaBle (Multi-Facet Blending), a data augmentation method that constructs condition-oriented training triplets through a three-stage process: facet decomposition, facet generation, and facet recomposition. Using only 1K source documents, FaBle synthesizes training pairs that significantly improve faceted QBE retrieval under data-scarce conditions, notably outperforming a strong baseline trained on over 1.3M data points in the most challenging "method" facet.
Multi-Hop Question Generation via Dual-Perspective Keyword Guidance: This paper defines dual-perspective keywords—question keywords (capturing the questioner's intent) and document keywords (reflecting content relevant to the QA pair)—and proposes the DPKG framework. DPKG seamlessly integrates these keywords into the multi-hop question generation process by utilizing an extended Transformer encoder and two answer-aware decoders.
Narrative Media Framing in Political Discourse: Integrates narratological theory with media framing analysis to propose a structured narrative framing analysis framework composed of three components: characters (hero/villain/victim), conflict/resolution, and cultural stories. The effectiveness and transferability of this framework are validated across two domains: climate change and COVID-19.
Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes: This paper proposes NeoDiff, which unifies the theoretical framework of discrete and continuous text diffusion models. By introducing a dual-time framework consisting of "extrinsic time" (sentence-level diffusion progress) and "intrinsic time" (token-level diffusion progress), NeoDiff utilizes a Poisson process to independently allocate fine-grained noise levels to each token and adaptively adjusts denoising progress with a context-aware time predictor. NeoDiff outperforms existing diffusion baselines across multiple tasks, including machine translation, paraphrasing, and text simplification.
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer: Proposes Neural Parameter Search (NPS), which improves the pruning efficiency of fine-tuned models by searching for optimal weight coefficients in the subspaces of the task vector. It achieves significant improvements across three scenarios: knowledge transfer (+1.5%), model merging (+2.1%), and compression (40% efficiency improvement).
Neuron Empirical Gradient: Discovering and Quantifying Neurons' Global Linear Controllability: This work reveals a global linear relationship between the activation values of Feed-Forward (FF) layer neurons and model outputs in pretrained language models. It introduces Neuron Empirical Gradient (NEG) to quantify this linear relationship and designs an efficient estimation method, NeurGrad. Finally, skill neuron probing experiments demonstrate that NEG can effectively characterize various language skills.
On Support Samples of Next Word Prediction: Based on the representer theorem, this paper investigates the role of training samples in next-word prediction of language models, identifying two types of support samples (facilitating prediction and suppressing prediction). It demonstrates that being a support sample is an intrinsic property of the sample itself (predictable prior to training), while non-support samples remain critical for representation learning.
Optimizing Decomposition for Optimal Claim Verification: Proposes the Dynamic Decomposition framework, which learns decomposition strategies from verifier feedback via reinforcement learning to decompose claims into the atomic granularity preferred by the verifier, thereby bridging the performance gap between decomposers and verifiers.
Partial Colexifications Improve Concept Embeddings: This work introduces partial colexification (affix/overlap colexification) into concept embedding training for the first time, consistently outperforming baselines that rely solely on full colexification across three tasks: semantic similarity modeling, semantic shift prediction, and word association prediction.
Towards Better Evaluation for Generated Patent Claims: This paper proposes the first evaluation benchmark for patent claims, Patent-CE (comprising 1,228 expert-annotated comparative evaluation data points), and a dedicated evaluation method, PatClaimEval (based on Longformer + a variant of contrastive learning). Across five dimensions—feature completeness, conceptual clarity, terminological consistency, logical connection, and overall quality—the proposed method consistently outperforms 13 existing baselines (including G-Eval-4) in correlation with human expert evaluation, achieving a 58% Spearman correlation improvement in the overall quality dimension.
Persistent Homology of Topic Networks for the Prediction of Reader Curiosity: This paper quantifies the topic network structure of text into topological voids (connected components, loops, cavities) using persistent homology, serving as a proxy for "information gaps" to predict reader curiosity, achieving 73% explained deviance on the novel The Hunger Games (vs. 30% for the baseline).
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games: The PANDA method is proposed to project human personality traits (a total of 8 traits from the Big Five and Dark Triad) into the policy learning of text-based game agents. By guiding Q-value adjustment through a personality classifier, it is discovered that the High Openness personality significantly outperforms other personality types in adventure-based text games.
All That Glitters is Not Novel: Plagiarism in AI Generated Research: Expert review of research documents generated by autonomous scientific agents (such as AI Scientist) reveals that 24% of the documents constitute "intelligent plagiarism"—where methodologies map one-to-one to prior works without citing the original sources, and existing plagiarism detection tools fail to identify such rebranded copying.
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment: The PopAlign framework is proposed, which constructs six diverse contrasting strategies across prompt, model, and pipeline levels (including the innovative Elicitive Contrast). It synthesizes high-quality preference data without additional human annotation, achieving a more comprehensive LLM alignment.
Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks: Establishes the first unified theoretical framework to understand the generalization behavior of Transformers on arithmetic tasks (addition/multiplication/modular arithmetic). Starting from the interaction between task properties (translation invariance) and position encoding types (APE/RPE), it explains several previous generalization puzzles that perplexed the field (e.g., addition generalizes but multiplication does not, mod 100 generalizes but mod 101 does not). These theoretical predictions are validated by experiments.
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering: This work proposes ProxAnn, a use-oriented evaluation protocol for topic models. By combining a scalable human evaluation pipeline with LLM proxy annotators, the study finds that the best LLM proxies are statistically indistinguishable from human annotators, serving as a reasonable alternative for automated evaluation.
PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings: Builds PVP (28,454 images, 596 behavioral messages, 9 persuasion strategies), the first large-scale dataset linking image persuasion strategies with the psychological traits (personality/values/moral foundations) of 2,521 annotators. It validates the critical role of psychological profiles in enhancing persuasion effects on two benchmark tasks: personalized persuasive image generation and automatic persuasiveness evaluation.
Quantifying Lexical Semantic Shift via Unbalanced Optimal Transport: Applies Unbalanced Optimal Transport (UOT) to sets of contextualized word embeddings, proposing the Sense Usage Shift (SUS) metric to quantify semantic changes at the level of individual usage instances, which unifies three tasks: instance-level change detection, word-level change magnitude quantification, and semantic expansion/reduction determination.
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability: Through systematic experiments across 18 datasets and 7 task categories, this paper finds that incorporating rationales (reasoning processes) into training data is not always beneficial—rationales sometimes impair model performance, but they can improve model reliability (calibration). Furthermore, the improvements in performance and reliability are linearly correlated, and both are driven by the inherent difficulty of the task.
RePanda: Pandas-powered Tabular Verification and Reasoning: RePanda is proposed to translate natural language claims into executable pandas queries for tabular fact verification, achieving an accuracy of 84.09% on TabFact and 84.72% on OOD WikiFact without additional fine-tuning. Meanwhile, with only a 7B parameter model, it approaches the zero-shot performance of the 671B DeepSeek-Chat, and scales to tabular question-answering tasks, achieving 75.1% accuracy.
Research Borderlands: Analysing Writing Across Research Cultures: Through interviews with interdisciplinary researchers, this work constructs a cultural norm framework for academic writing (comprising four categories: structure, style, rhetoric, and citation). It quantifies writing differences across 11 CS communities using computational metrics, revealing a severe "homogenization" tendency in LLMs during cross-community writing adaptation.
Revisiting Weak-to-Strong Generalization: Reverse KL vs. Forward KL: In the Weak-to-Strong Generalization (W2SG) framework, this paper proposes replacing forward KL with reverse KL as the loss function. It is theoretically proven that the mode-seeking property of reverse KL ensures the strong model outperforms the weak supervisor by at least the magnitude of the "disagreement". Experiments on the GPT-2, Pythia, and Qwen2.5 series validate that reverse KL/CE outperforms forward KL in 12 out of 12 settings, demonstrating superior noise robustness.
RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation: Inspired by the residual learning in ResNet, this paper proposes the RMoA framework. It optimizes multi-agent collaboration architectures through embedding-based greedy diversity selection, residual extraction/aggregation agents, and an adaptive termination mechanism, achieving state-of-the-art (SOTA) performance while reducing computational overhead.
RoToR: Towards More Reliable Responses for Order-Invariant Inputs: Proposes RoToR, a zero-shot, order-invariant language model based on global ordering and circular position encoding allocation. It achieves stable order invariance by minimizing position ID modifications and designs a Selective Routing mechanism to adaptively handle mixed input types.
S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling: Proposes S2WTM, a topic model based on a spherical sliced-Wasserstein autoencoder. It aligns the aggregated posterior and prior distributions on a hyperspherical latent space, effectively avoiding the posterior collapse problem of VAEs while outperforming existing SOTAs in topic coherence and diversity.
S3 - Semantic Signal Separation: S3 conceptualizes topic modeling as discovering independent semantic axes within a semantic space. By utilizing Independent Component Analysis (ICA) to decompose document embedding matrices, it produces highly coherent and diverse topics without requiring preprocessing, while standing out as the fastest contextual topic model (averaging 4.5 times faster than BERTopic).
Segment-Based Attention Masking for GPTs: MAS (Masked Attention by Segment) replaces the causal attention mask with segment-based bidirectional attention in the prefill phase of pretrained GPT models—tokens within the same segment can attend to each other, while the generation phase still maintains the causal mask—consistently improving performance on 8 commonsense reasoning tasks via LoRA finetuning (average of +1.8% on Llama-3-8B and +3.3% on Llama-3.2-3B) with zero additional computational overhead.
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection: To address two major pain points in the evaluation of Open Domain Event Detection (ODED)—namely, the lack of real-world representativeness in limited benchmarks and the inability of token-level matching metrics to capture semantic similarity—this work proposes the SEOE framework. It constructs a scalable benchmark containing 564 event types across 7 major domains and introduces an LLM-based semantic \(F_1\) evaluation metric.
Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions: Proposes an end-to-end framework to generate synthetic users with health conditions (covering sleep and diabetes management) based on real demographic, health/lifestyle, and behavioral/psychological profile data. This framework is used to evaluate the interaction quality of health coaching agents, and is validated through human expert evaluation to significantly outperform generic synthetic users.
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning: This paper proposes SoRFT (Subtask-oriented Reinforced Fine-Tuning), which decomposes the GitHub Issue resolving task into four subtasks: file localization, function localization, line localization, and code editing. Through a two-stage training process consisting of rejection-sampled SFT and rule-based PPO reinforcement learning, SoRFT significantly enhances the issue-resolving capabilities of open-source LLMs on SWE-Bench.
SOTOPIA-Ω: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents: This paper proposes the SOTOPIA-Ω framework, which dynamically injects multi-step reasoning strategies and direct strategies from negotiation theory into expert agents to automatically construct high-quality social dialogue training corpora. It defines a new concept "Social Instruction Following (S-IF)" along with two evaluation metrics, enabling a 7B model to outperform GPT-4 expert agents in social goal achievement.
SPOT: Bridging Natural Language and Geospatial Search for Investigative Journalists: Proposes the SPOT system, which fine-tunes LLaMA 3 to translate natural language scene descriptions into YAML queries, combining this with a semantic tag bundling mechanism to enable reliable natural language access to OpenStreetMap data, serving geolocation verification in investigative journalism.
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation: An atomic-level (sentence-level) evaluation framework is proposed to detect fine-grained Out-of-Character (OOC) behaviors of Large Language Models (LLMs) in open-ended generation through three metrics (ACC_atom, IC_atom, RC_atom). This addresses the issue where traditional holistic scoring approaches fail to capture subtle personality inconsistencies in long texts.
Statistical Deficiency for Task Inclusion Estimation: Based on the theory of statistical deficiency, this paper proposes a theory-driven framework for defining and measuring task inclusion relations. Using information sufficiency (IS) as a computable proxy metric, the framework estimates the degree of inclusion between tasks by comparing the intermediate layer representations of fine-tuned models. It successfully reconstructs the hierarchical relationships of a classic NLP pipeline on both synthetic data and real-world NLP tasks.
STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond: This paper proposes the STRICTA framework, which models text assessment as an explicit, step-by-step reasoning graph (workflow) based on Structured Causal Models (SCMs). By collecting a dataset of over 4,000 reasoning steps from more than 40 experts reviewing biomedical papers, the study finds that differences in prior knowledge are the primary cause of expert disagreement, and writing style has a causal impact on the final evaluation. Furthermore, while LLMs suffer from error propagation, this can be effectively mitigated with human supervision.
Subword Models Struggle with Word Learning, but Surprisal Hides It: Through the psycholinguistic lexical decision task, this paper reveals that subword (BPE) language models are far inferior to character-level models in isolated word recognition, and that the commonly used surprisal metric masks this deficiency by introducing syntactic context.
TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation: TabXEval proposes a rubric-based two-stage table evaluation framework: structural alignment via TabAlign followed by fine-grained semantic and syntactic comparison via TabCompare, accompanied by the multi-domain benchmark TabXBench.
TACLR: A Scalable and Efficient Retrieval-Based Method for Industrial Product Attribute Value Identification: TACLR proposes the first product attribute value identification (PAVI) method based on the retrieval paradigm. By incorporating taxonomy-aware contrastive learning and an adaptive inference mechanism, it completely outperforms classification and generation methods in handling implicit values, OOD (out-of-distribution) values, and normalized outputs, and has been successfully deployed on the Xianyu platform.
Tag-Evol: Achieving Efficient Instruction Evolving via Tag Injection: Tag-Evol proposes an instruction evolution framework based on knowledge tag injection. By constructing a multi-step fine-grained tag pool and a budget-controlled injection mechanism, it generates high-quality evolved instruction data of varying difficulties without iteration, significantly outperforming Evol-Instruct across multiple tasks and backbone models.
TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data: TARGA proposes a targeted synthetic data generation framework that dynamically generates highly relevant synthetic examples for in-context learning in KBQA without requiring any human annotation. Using only a 7B model, it significantly outperforms all non-fine-tuned methods on GrailQA (+7.7 F1) and KBQA-Agent (+12.2 F1).
Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text: TIACBM proposes a task-informed anti-curriculum masking fine-tuning strategy: leveraging downstream task knowledge (e.g., sentiment polarity, part-of-speech tags) to determine which tokens are masked, and employing a cyclically decaying masking rate. It achieves statistically significant performance improvements across three tasks: sentiment analysis, text classification, and authorship attribution.
The Harmonic Structure of Information Contours: Proposes the Harmonic Surprisal (HS) hypothesis—that surprisal curves in text fluctuate periodically and align with discourse structures (EDUs/sentences/paragraphs). Tested via harmonic regression with time scaling, consistent periodic patterns are found across 6 languages, refining the classical Uniform Information Density hypothesis.
The Hidden Attention of Mamba Models: Reveals that Mamba (selective state space model S6) can be reformulated as an implicit causal self-attention mechanism, and based on this, proposes attention visualization and interpretability methods (Attention Rollout and Mamba-Attribution) applicable to Mamba models, proving that its interpretability metrics are comparable to those of Transformers.
The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research: An end-to-end computational pipeline is developed to quantify scholarly citation fidelity at scale. Analyzing 13 million citation sentence pairs reveals critical factors influencing citation fidelity, and a quasi-causal experiment confirms a "telephone effect" where low-fidelity intermediate citations lead to further distortion in subsequent citations.
The Time Scale of Redundancy between Prosody and Linguistic Context: This study systematically investigates the time scale of redundancy between prosodic features (such as pitch, loudness, and duration) and linguistic context. It reveals that the redundancy between prosody and past context spans a relatively long time scale (3-8 words), whereas the redundancy with future context is limited to a short time scale (1-2 words). This highlights the dual role of prosody in spoken communication: aiding the integration of past information and predicting upcoming words.
Theoretical Guarantees for Minimum Bayes Risk Decoding: This paper provides the first rigorous theoretical convergence guarantees for Minimum Bayes Risk (MBR) decoding, proving that MBR decoding approximates the optimal solution at a rate of \(O(n^{-1/2})\) when the size of the reference hypothesis set is \(n\). It also theoretically compares MBR with MAP decoding, showing that MBR converges faster in various scenarios.
Learning to Reason Over Time: Timeline Self-Reflection for Temporal Reasoning: The TISER framework is proposed to achieve test-time scaling for LLM temporal reasoning through a four-stage pipeline of "reasoning → timeline construction → self-reflection → answer generation." When combined with fine-tuning on synthetic reasoning trajectory data, this framework enables 7B open-source models to outperform GPT-4 on multiple temporal reasoning benchmarks and achieve SOTA results on tasks such as TGQA.
Tokenisation is NP-Complete: Proves that two variants of the tokenisation problem—direct tokenisation and bottom-up tokenisation—are both NP-complete. This is achieved via a polynomial-time reduction from the max-2-SAT problem, implying that finding an efficient optimal tokenisation algorithm is impossible and thus justifying the use of approximation methods like BPE.
Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method: This paper proposes an annotation scheme for Chinese high school argumentative essays, featuring 14 fine-grained argumentation relation types across two dimensions: vertical (argumentation relations) and horizontal (discourse relations). It establishes a comprehensive benchmark covering three tasks: argumentative component detection, relation prediction, and automated essay scoring.
Towards Text-Image Interleaved Retrieval: This paper defines the new task of Text-Image Interleaved Retrieval (TIIR), constructs the first TIIR benchmark dataset based on wikiHow (155K documents, 7,654 test pairs), and proposes the Matryoshka Multimodal Embedder (MME). MME addresses the efficiency issue and semantic bias caused by excessive visual tokens in MLLMs via multi-granularity visual token compression, significantly improving retrieval performance.
Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis: The Tree-of-Debate (ToD) framework is proposed, which represents scientific papers as LLM personas that engage in tree-structured debates. Through self-deliberation, iterative retrieval, and moderator-guided hierarchical subtopic expansion, ToD generates fine-grained, contextualized comparative summaries of papers, significantly outperforming baseline methods in evaluations by domain experts.
TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification: Proposes the TROVE text provenance challenge, which traces each sentence in the target text back to specific source sentences in the source documents and classifies their fine-grained relationships (quotation, compression, inference, etc.), covering multi-document and long-document scenarios.
Tuna: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos: Tuna constructs a fine-grained, multi-dimensional annotated dataset of 1,000 temporally dense short videos, along with two evaluation tasks: captioning (event splitting → matching → relationship classification) and temporal question answering. This systematically exposes the weaknesses of current video LMMs in dynamic temporal understanding.
Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs: This paper empirically reveals a significant correlation between the misalignment of common ground and task success by annotating "conversational friction" in Ubuntu IRC technical support dialogues, and finds that LLMs can identify explicit conversational friction but struggle with implicit friction requiring pragmatic or domain reasoning.
Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling: This work introduces domain adaptation formally into low-resource topic modeling for the first time. It derives a finite-sample generalization upper bound to guide method design and proposes the DALTA framework, which selectively transfers cross-domain topic knowledge through a shared encoder, domain-specific decoders, and adversarial alignment.
Unique Hard Attention: A Tale of Two Sides: This paper proves that in finite-precision transformers, leftmost unique hard attention (UHA) is strictly weaker than rightmost UHA. The former is equivalent to the linear temporal logic fragment LTL[\(\Diamond^-\)] (i.e., partially ordered finite automata) and has the same expressive power as soft-attention transformers, thereby precisely characterizing the impact of attention directionality on transformer expressivity.
Unlocking Speech Instruction Data Potential with Query Rewriting: This work proposes a query rewriting framework based on multi-LLM knowledge fusion and a multi-agent annotation verification method. It rewrites text instructions that exceed the TTS vocabulary into formats suitable for speech synthesis, increasing the usability rate of speech instruction data from 72% to 93% to construct high-quality speech instruction datasets for end-to-end Large Speech Language Models (LSLMs).
USDC: A Dataset of User Stance and Dogmatism in Long Conversations: This paper constructs USDC, the first user-level long-conversation dataset for stance and dogmatism, containing 764 multi-user Reddit conversations (across 22 subreddits). Using a majority voting scheme over six configurations ({Mistral Large, GPT-4} \(\times\) {zero/one/few-shot}), stances are annotated on a 5-level scale and dogmatism on a 4-level scale. Baseline performances are established using fine-tuning/instruction-tuning on 7 SLMs.
Using Shapley Interactions to Understand How Models Use Structure: Using the Shapley Taylor Interaction Index (STII) to systematically analyze cross-modality (text and speech) how language models encode syntactic structure, non-compositional semantics, and phonetic coarticulation through non-linear interactions, it is found that autoregressive models significantly outperform masked models in syntactic encoding.
Using Source-Side Confidence Estimation for Reliable Translation into Unfamiliar Languages: This paper proposes a gradient-based source-side confidence estimation method that directly detects potential mistranslations by measuring the sensitivity of the output sequence probability to the source embeddings. This approach outperforms traditional methods without requiring word alignment, and supports the construction of an interactive translation Web application for users fluent in the source language.
Value Residual Learning: The authors propose ResFormer and SVFormer, which introduce a residual connection from the first-layer Value vector to subsequent layers in the attention mechanism. This dynamic enhances the propagation of initial token-level information in deep networks. Consequently, these models achieve comparable performance to standard Transformers with 16.11% fewer parameters and 20.3% less training data, while SVFormer also reduces KV cache by nearly half.
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?: This paper introduces the VAQUUM dataset (20,300 human ratings, 1,089 images) to systematically evaluate the alignment between vision-language models (VLMs) and humans regarding the use of vague quantifiers (e.g., few, many). The findings show that while VLMs are influenced by object counts similarly to humans, model performance varies significantly across different evaluation paradigms, indicating that judging and generating vague quantifiers depend on distinct cognitive processes.
Verbosity-Aware Rationale Reduction: Effective Reduction of Redundant Rationale: The VARR framework is proposed to identify and remove redundant sentences in reasoning paths on a sentence-by-sentence basis using a likelihood-based "verbosity" criterion, achieving an average accuracy improvement of 7.71% while reducing token generation by 19.87% across various reasoning tasks.
Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction: This paper proposes MM-VAP, a multimodal predictive turn-taking model. By incorporating visual cues such as facial expressions, head poses, and gaze direction into a voice predictive model, it improves the hold/shift prediction accuracy from 79% to 84% on a video conferencing corpus.
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding: This work proposes the Weak-to-Strong Decoding (WSD) framework, which leverages a small aligned model to draft aligned response prefixes for a large base model to continue writing. This achieves low-resource preference alignment without introducing alignment tax.
What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma: Constructs an expert-annotated mental health stigma interview corpus (4,141 utterances, 684 participants) based on attribution theory, covering 7 fine-grained stigma types and socio-cultural background information, and benchmarks multiple SOTA neural models on the performance and challenges of the stigma detection task.
What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation: This paper systematically studies the automatic evaluation of book-length stories (>100K tokens), constructs the first large-scale long story evaluation benchmark LongStoryEval (600 newly published novels, 340K reader reviews), proposes a hierarchical evaluation criteria framework, compares the effectiveness of three evaluation strategies, and trains a specialized evaluation model NovelCritique-8B, which outperforms GPT-4o in alignment with human ratings.
Words of Warmth: Trust and Sociability Norms for over 26k English Words: The first large-scale warmth (\(W\)), trust (\(T\)), and sociability (\(S\)) association lexicon (covering 26k+ English words) was constructed through a rigorous crowdsourced annotation process. Its extensive value in social cognition research is demonstrated through analyses of child vocabulary acquisition and social media stereotype case studies.
You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with Multi-Agent Conversations: This paper proposes the MIMIC framework to generate synthetic meeting transcripts through multi-agent debate simulations, constructing the FAME dataset consisting of 800 meetings (500 English + 300 German), and designing a psychological-behavior-based evaluation framework for conversational realism.
Your Model is Overconfident, and Other Lies We Tell Ourselves: Through a comprehensive analysis of 29 models on the ChaosNLI and DynaSent datasets, this work reveals a correlated but non-linear and non-monotonic relationship among data complexity metrics such as annotator disagreement, training dynamics, and model confidence, challenging the common assumption that "model uncertainty \(\approx\) human disagreement."
Zero-Shot Conversational Stance Detection: Dataset and Approaches: This work constructs the first zero-shot multi-turn multi-party conversational stance detection dataset, ZS-CSD (280 targets, 17,063 conversational samples), and proposes the SITPCL model. By combining a speaker interaction encoder with target-aware prototypical contrastive learning, SITPCL achieves state-of-the-art performance (F1-macro of 43.81%) in zero-shot conversational stance detection.