ICML2025 LLM Reasoning AI paper notes paper summaries Reasoning LLM Robotics Adversarial Robustness Alignment/RLHF

💡 LLM Reasoning¶

🧪 ICML2025 · 19 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (82)

🔥 Top topics: Reasoning ×11 · LLM ×6

Ad-Hoc Human-AI Coordination Challenge (AH2AC2): This work proposes the AH2AC2 challenge—based on the cooperative card game Hanabi—which constructs human proxy agents via behavioral cloning and regularized reinforcement learning, and open-sources a limited human dataset to provide a standardized, reproducible evaluation framework for human-AI ad-hoc coordination research.
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism: AdaDecode achieves high-confidence early token prediction by training lightweight LM heads at middle layers, and defers the KV cache computation of subsequent layers to be processed in parallel. While maintaining identical output with standard autoregressive decoding, it achieves up to 1.73× decoding throughput acceleration.
Adversarial Manipulation of Reasoning Models using Internal Representations: This paper finds that reasoning models (such as DeepSeek-R1-Distill-Llama-8B) exhibit a linear "caution direction" in the activation space during the CoT generation phase. Ablating this direction effectively jailbreaks the model, revealing that CoT itself is a new target for adversarial attacks.
DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination: Based on the concept of metamorphic testing, this work decomposes programming problems into complexity-related algorithmic abstractions and complexity-independent contextual descriptions. Through the collaboration of four LLM agents, it automatically generates semantically equivalent yet textually distinct variants of programming problems. This effectively mitigates data contamination and evaluates the true reasoning capabilities of Code LLMs, validating the effectiveness of the framework across 18 models.
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models: Through causal, representation, and attention analyses, this paper identifies a three-stage emergent symbolic architecture supporting abstract reasoning across 13 open-source LLMs: symbolic abstraction heads transform input tokens into abstract variables, symbolic induction heads perform sequence induction at the abstract variable level, and retrieval heads retrieve the corresponding values based on predicted abstract variables for next-token prediction.
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators: This paper proposes the JETTS benchmark to systematically evaluate the performance of LLM-judges as evaluators in test-time scaling scenarios (response reranking, step-level beam search, and critique-based refinement). The findings show that while judges are competitive with outcome reward models in reranking, they are significantly weaker than process reward models in beam search, and natural language critiques currently fail to effectively guide generator improvements.
FMC: Formalization of Natural Language Mathematical Competition Problems: This paper proposes a fully automated formalization pipeline based on LLM error feedback that translates natural language mathematical competition problems into Lean formal representations, constructing FMC, an Olympiad-level dataset containing 3,922 natural language problems aligned with 9,787 Lean formalizations, and validating its value as an automated theorem proving benchmark.
Improving Rationality in the Reasoning Process of Language Models through Self-playing Game: This paper proposes the Critic-Discernment Game (CDG), a self-playing language game where an LLM interacts with a "Helpful Critic" and a "Misleading Critic." Using Reinforced Self-Training (ReST), the three roles are jointly optimized. Without relying on human or stronger model supervision, this approach significantly enhances the LLM's rational understanding of its own reasoning process, achieving consistent improvements across four tasks: mathematical reasoning, step-by-step error detection, self-correction, and long-chain reasoning.
MARGE: Improving Math Reasoning for LLMs with Guided Exploration: MARGE proposes a "hit-guided exploration" approach to enhance the mathematical reasoning capabilities of LLMs. By systematically exploring the intermediate reasoning states in self-generated solutions, it achieves thorough exploration and better credit assignment without requiring external annotations or additional value models, simultaneously improving single-attempt accuracy and exploration diversity.
No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks: This paper demonstrates that all current state-of-the-art neural network verifiers only provide "theoretical soundness" (bounding exact-precision output) rather than "practical soundness" (bounding floating-point outputs in deployment environments), and empirically verifies that all tested verifiers can be deceived by constructing environment-sensitive adversarial backdoor networks.
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL: This paper introduces the Long CoT Collection—a 100K long-chain reasoning dataset annotated by short CoT LLMs (e.g., GPT-4o). By extracting reasoning flows from o1 as indirect guidance, short CoT models are enabled to generate high-quality long reasoning chains. This effectively mitigates the cold-start problem of open-source reasoning models during the reinforcement learning phase, yielding a 2 to 3-fold performance improvement in RLVR for the initialized models.
PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation: Proposes PCoT (Persuasion-Augmented Chain of Thought), which utilizes a two-stage reasoning framework—first requiring the LLM to identify persuasive strategies in the text, and then injecting the persuasion analysis results into the disinformation detection reasoning. In zero-shot settings, it achieves an average F1 improvement of approximately 15% across 5 LLMs and 5 datasets.
PENCIL: Long Thoughts with Short Memory: Proposed PENCIL (PENCIL ENables Context-efficient Inference and Learning), which introduces a functional call stack-inspired reduction rule during autoregressive generation to recursively clear completed intermediate reasoning steps, enabling LLMs to solve computationally hard problems requiring exponential context using only polynomial context length.
ProofCompass: Enhancing Specialized Provers with LLM Guidance: ProofCompass proposes a training-free hybrid approach that uses general LLMs to provide natural language proof strategies and intermediate lemma selections for specialized theorem provers (such as DeepSeek-Prover-v1.5-RL). It outperforms the baseline on miniF2F (54.9% → 55.3%) with 25 times fewer attempts.
Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs: This work introduces Putnam-AXIOM, a benchmark comprising 522 university-level Putnam competition math problems and 100 programmatic functional variants, which reveals memorization reliance in LLM mathematical reasoning and introduces Teacher-Forced Accuracy (TFA) as an evaluation metric for reasoning quality beyond final answers.
Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning: This work systematically analyzes the "snowball error" phenomenon in LLM reasoning from an information-theoretic perspective, establishes a theoretical link between snowball errors and the probability of correct reasoning, and demonstrates that external slow-thinking methods (e.g., BoN, MCTS) inherently mitigate error accumulation by scaling search width. Both theoretically and experimentally, it is proven that the effectiveness of these methods primarily depends on the total inference budget and the reliability of the reward function rather than the search framework itself.
Self-Consistency Preference Optimization: Introduce the concept of self-consistency from inference into the training phase, construct preference pairs through a voting mechanism, and perform iterative training using a weighted DPO loss. This significantly improves the mathematical and logical reasoning capabilities of LLMs without requiring gold labels.
Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration: This paper introduces Soft Reasoning, which injects Gaussian perturbations into the embedding space of the first generated token and utilizes Bayesian optimization to search for the optimal perturbation vector. This guides LLMs to explore better solution spaces during inference in a black-box manner without requiring access to model parameters or external verifiers. It outperforms baselines like temperature scaling and Best-of-N on mathematical reasoning tasks with extremely low computational overhead.
Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness: This paper systematically analyzes the factors influencing the performance of CoT from the two dimensions of effectiveness and faithfulness. It finds that problem difficulty, information gain, and information flow are key factors affecting CoT effectiveness. The root cause of unfaithful CoT is that the model directly recalls the correct information from the question while bypassing the CoT during answer prediction. Based on this, the QUIRE method is proposed to improve both CoT effectiveness and faithfulness.