Skip to content

📂 Others

💬 ACL2026 · 5 paper notes

Agree, Disagree, Explain: Decomposing Human Label Variation in NLI through the Lens of Explanations

This paper extends the LiTEx reasoning taxonomy from "label-consistent, explanation-variant" settings to label-disagreement scenarios, finding that annotators may share similar reasoning strategies despite assigning different labels, and that reasoning category agreement better reflects the semantic similarity of explanations than label agreement alone.

Are Large Language Models Economically Viable for Industry Deployment?

This paper proposes Edge-Eval, a framework that evaluates LLMs across their full deployment lifecycle on legacy T4 GPUs using five deployment metrics—economic break-even, intelligence-per-watt, system density, cold-start tax, and quantization fidelity. The framework reveals that sub-2B models comprehensively outperform 7B models on both economic and ecological dimensions, and uncovers the counterintuitive finding that QLoRA, while reducing memory by ~60%, can increase energy consumption by up to 7×.

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

This paper proposes PTE (Prefill Token Equivalents), a hardware-aware efficiency metric for tool-integrated reasoning (TIR) that unifies the costs of internal reasoning and external tool use. Through large-scale experiments, the paper identifies four inefficiency patterns in TIR: confirmatory tool use, tool mixing, lack of tool priors, and tool format collapse.

Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation

This paper deploys LLM agents (DeepSeek/GPT series) in the classic beer distribution game to simulate multi-stage supply chains, systematically investigating how cognitive heterogeneity (differences in reasoning capability) affects system behavior. The findings demonstrate that LLM agents can reproduce human-observed bullwhip effects and myopic behaviors, and that information sharing effectively mitigates these adverse effects.

Reliable Evaluation Protocol for Low-Precision Retrieval

This paper identifies that low-precision retrieval systems (e.g., binarized or quantized embeddings) suffer from a large number of spurious ties due to reduced score granularity, leading to highly unstable evaluation results. Two complementary strategies are proposed—High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM)—to enable more reliable and consistent evaluation of low-precision retrieval systems.