💡 LLM Reasoning¶
📷 CVPR2026 · 16 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (81) · 📹 ICCV2025 (3)
🔥 Top topics: Reasoning ×14
- Agile Deliberation: Concept Deliberation for Subjective Visual Classification
-
For subjective concepts with fuzzy boundaries like "healthy food" or "clickbait," this work proposes Agile Deliberation, a human-in-the-loop framework. The system decomposes concepts into hierarchies of positive/negative sub-concepts, iteratively retrieves "semantic boundary samples" for user annotation and reflection, and automatically compiles feedback into VLM prompts. This allows the image classifier to align with users' evolving intentions. In 18 real-user experiments, it outperformed automatic decomposition baselines by 7.5% in F1 and manual deliberation by over 3%.
- APPO: Attention-guided Perception Policy Optimization for Video Reasoning
-
APPO identifies that "the bottleneck of video reasoning lies in perception rather than reasoning." It leverages the model's own attention on video frames to convert sparse outcome rewards into token-level dense rewards. By applying differential weighted learning to "intra-group perception tokens" that focus on the same key frames across different responses based on reward disparities, it consistently outperforms GRPO and DAPO on Qwen2.5-VL-3/7B by 0.5%–4%.
- Dynamic Important Example Mining for Reinforcement Finetuning
-
In each training step of RFT (GRPO/PPO, etc.), DIEM uses the "inner product between single-sample gradients and the total batch gradient" to estimate the marginal contribution of each sample to current policy improvement in real-time. It then solves a constrained optimization problem to reweight samples while maintaining the gradient magnitude. With nearly zero extra overhead (+1.3% time), it improves multimodal reasoning benchmarks by 1–6 points on average.
- E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
-
Constructed the first multi-dimensional quality evaluation framework for Chinese e-commerce posters, E-comIQ-ZH, consisting of an 18K expert-annotated dataset (including CoT reasoning chains), a dedicated evaluation model E-comIQ-M (trained via SFT+GRPO), and a standardized benchmark E-comIQ-Bench.
- EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
-
The proposed EagleVision is a dual-stage framework. In the macro-perception stage, it utilizes Semantic-Perspective Fusion DPP (SPF-DPP) to jointly optimize semantic relevance and perspective diversity in \(SE(3)\) space for keyframe selection. In the micro-verification stage, the model actively queries new perspective frames on a BEV plane to conduct iterative spatial CoT reasoning (hypothesis \(\rightarrow\) view \(\rightarrow\) verification loop). The query strategy is trained purely via RL without human annotation, achieving open-source SOTA on VSI-Bench and SQA3D.
- FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle
-
A VLM (Oracle) fine-tuned with GRPO and Chain-of-Thought (CoT) reasoning first infers a scalar wildfire risk score from satellite imagery and climate data. Then, FiLM is used to feed this score into a lightweight vision Encoder-Decoder to generate a high-resolution continuous risk raster. In a "US training, Europe testing" cross-continent setting, explicit linguistic reasoning significantly improves out-of-distribution (OOD) generalization, and the reasoning traces are interpretable and recoverable by wildfire experts.
- Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
-
Hilbert-Geo is the first unified formal language framework (including a predicate library and a theorem library) for solid geometry. It utilizes a "Parse2Reason" approach: first, a Multimodal Large Language Model (MLLM) translates text and 3D diagrams into a formal Condition Description Language (CDL); then, a specialized symbolic reasoning engine performs rigorous theorem searching. This method improves MLLM accuracy in solid geometry from approximately 50% to 77.3%, approaching human performance.
- Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
-
This work decomposes the iterative human cognitive process of "understanding-solving-reunderstanding" into a cyclic interaction between an Understanding Module (UM) and a Solving Module (SM). Supplemented by representation isomorphism constraints and an adaptive halting mechanism, a small model with only 7M parameters achieves 47.2% accuracy on ARC-AGI-1, surpassing TRM and several general-purpose large language models.
- Latent Chain-of-Thought World Modeling for End-to-End Autonomous Driving
-
LCDrive proposes the Latent Chain-of-Thought (Latent CoT) framework, which replaces natural language CoT for reasoning with action proposal tokens and world model prediction tokens. Through cold-start and RL post-training, it achieves lower latency and superior trajectory quality for end-to-end autonomous driving.
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
-
This paper discovers that existing LVLMs actually ignore the content of intermediate rationales during CoT reasoning. It proposes RED (Rationale-Enhanced Decoding), which multiplies next-token distributions conditioned on images and rationales at the logit level. Theoretically equivalent to the optimal solution for KL-constrained reward maximization, RED significantly improves multimodal reasoning accuracy without requiring training.
- Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
-
This paper utilizes a latent space learned via a VAE to inject a "reasoning palette" into (V)LMs. Each sampled latent variable is decoded into a learnable prefix prepended to the prompt, enabling the model to select a specific reasoning style before generating the first token. This approach upgrades "token-level random sampling" in RL to "strategy-level structured exploration," consistently outperforming standard GRPO/RLOO on multiple mathematical reasoning benchmarks.
- ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
-
ReLaX abandons the practice of forcibly increasing token-level entropy to counteract entropy collapse in RLVR. Instead, it utilizes the Koopman operator to linearize the latent state dynamics of large reasoning models and introduces "Dynamic Spectral Divergence (DSD)" to quantify internal computational flexibility. By integrating DSD into the GRPO objective, it achieves new SOTA performance on 7 multimodal and 6 text-based reasoning benchmarks.
- Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
-
The authors systematically compare three "think with image" supervision formats—Language CoT, Grounding CoT, and Visual CoT—using a controlled maze navigation task. They find that longer or more elaborate Visual CoTs only accelerate convergence without raising the final performance ceiling. Conversely, a minimalist CoT preserving only essential grounding information (a single coordinate path) achieves the best generalization. The paper proposes the "short is long" effect and provides a practical guide for constructing generalizable visual reasoning SFT data.
- Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
-
This work proposes VISTA-Gym, a scalable training environment for visual tool agents (comprising 7 task categories, 13 datasets, and 26 standardized visual tools). Within this environment, the authors train VISTA-R1 using a "Behavioral Cloning (BC) warm-up + multi-round online GRPO" paradigm. This enables 8B-scale VLMs to dynamically select, invoke, and coordinate visual tools during reasoning, outperforming SOTA models of similar scale by 9.51%–18.72% across 11 reasoning-intensive VQA benchmarks.
- Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
-
TaYS transforms the video reasoning of Large Vision-Language Models (LVLMs) from a "batch" paradigm (look-at-all-then-think) to a "streaming" paradigm (think-while-looking). By utilizing a streaming attention mask, decoupled positional encoding, and a parallel dual KV cache, reasoning proceeds incrementally in synchronization with video frames. On VideoEspresso, the Time-to-First-Token (TTFT) is reduced from 10.6s to near zero, reasoning-event deviation is lowered by 55%, and reasoning accuracy is improved by 2.9%.
- VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
-
Ours proposes VisRef, a training-free visual refocusing framework. During the inference of Multi-modal Large Reasoning Models (MLRMs), VisRef adaptively selects and re-injects a subset of visual tokens that are semantically relevant to the current reasoning state and visually diverse using Determinantal Point Processes (DPP). Combined with an entropy-based stopping criterion to prevent over-reasoning, VisRef improves visual reasoning accuracy by up to 6.4% under a fixed computational budget.