💡 LLM Reasoning¶

📷 CVPR2026 · 14 paper notes

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D: A new paradigm called "Artistic Disparity Synthesis" (Art3D) is proposed, shifting the goal of 2D-to-3D conversion from geometric accuracy to artistic expression. A dual-path architecture decouples global depth style from local artistic effects, learning directorial intent from professional 3D film data.
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought: This work constructs E-comIQ-ZH, the first multi-dimensional quality assessment framework for Chinese e-commerce posters, comprising an 18K expert-annotated dataset with CoT reasoning chains, a dedicated evaluation model E-comIQ-M (trained via SFT+GRPO), and a standardized benchmark E-comIQ-Bench.
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence: EagleVision is a dual-stage framework in which the macro-perception stage employs Semantic-Pose Fusion DPP (SPF-DPP) to jointly optimize semantic relevance and viewpoint diversity in SE(3) space for key-frame selection, while the micro-verification stage enables the model to actively query new viewpoint frames on the BEV plane for iterative spatial CoT reasoning (hypothesis → observe → verify loop). The query strategy is trained purely via RL without human annotation, achieving open-source SOTA on VSI-Bench and SQA3D.
GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization: GRAZE is proposed as a training-free pipeline that leverages Grounding DINO to discover candidate interactions and employs SAM2 mask overlap as a pixel-level contact verifier, achieving 97.4% coverage and 77.5% contact onset frame localization accuracy within ±10 frames on 738 American football training videos.
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing: The paper introduces FaceCoT, the first CoT-VQA dataset for face anti-spoofing (FAS) with 1.08 million samples covering 14 attack types, and proposes a two-stage progressive learning strategy CEPL, achieving an average AUC improvement of 4.06% and HTER reduction of 5.00% across 11 FAS benchmarks.
Latent Chain-of-Thought World Modeling for End-to-End Autonomous Driving: LCDrive proposes a Latent Chain-of-Thought (Latent CoT) framework that replaces natural language CoT with action proposal tokens and world model prediction tokens for reasoning, achieving lower latency and superior trajectory quality in end-to-end autonomous driving via cold-start + RL post-training.
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought: This work identifies that existing LVLMs effectively ignore intermediate rationale content during CoT reasoning, and proposes RED (Rationale-Enhanced Decoding)—multiplying the image-conditioned and rationale-conditioned next-token distributions at the logit level. This approach is theoretically equivalent to the optimal solution of KL-constrained reward maximization, and significantly improves multimodal reasoning accuracy without any training.
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought: This work identifies that existing LVLMs neglect the generated rationale content during multimodal CoT reasoning (image tokens dominate attention), and proposes Rationale-Enhanced Decoding (RED)—reformulating CoT as a KL-constrained rationale-conditioned log-likelihood reward maximization problem. The closed-form optimal solution multiplies the image-conditioned distribution \(p(y|x,q)\) by the rationale-conditioned distribution \(p(y|r,q)^\lambda\), significantly improving reasoning performance across multiple benchmarks without any training.
Reinforcing Structured Chain-of-Thought for Video Understanding: This paper proposes SDRL (Summary-Driven Reinforcement Learning), a single-stage RL framework that requires no SFT. By introducing a structured CoT (Summarize→Think→Answer) and two self-supervised mechanisms (CVK and DVR), SDRL enhances temporal reasoning in video understanding and achieves state-of-the-art results on 7 VideoQA benchmarks.
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering: This work constructs Step-CoT, the first structured multi-step CoT medical reasoning dataset aligned with clinical diagnostic workflows (10K+ cases / 70K QA pairs), and proposes a teacher-student framework based on graph attention networks for stepwise reasoning supervision, improving both accuracy and interpretability in Med-VQA.
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models: This paper systematically analyzes the causes of hallucinations in multimodal CoT models, identifies "divergent thinking" (associative reasoning) as the core trigger, and proposes a training-free detection and decoding intervention strategy based on visual entropy. The method reduces CHAIRS by over 30% on Object HalBench while maintaining or improving general reasoning capability.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models: This paper proposes the Hallucination-as-Cue analytical framework, systematically investigating the true mechanisms underlying RL post-training of multimodal reasoning models via three modality-specific corruption strategies (blank image, random image, text removal). The study finds that GRPO training with 100% corrupted visual inputs still yields significant improvements in reasoning performance, challenging the prevailing assumption that RL training effectively leverages visual information.
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models: This paper proposes VisRef, a training-free visual refocusing framework that, during inference in multimodal large reasoning models (MLRMs), adaptively selects a semantically relevant and visually diverse subset of tokens at each reasoning step via Determinantal Point Processes (DPP) and reinjects them into the context. An entropy-based stopping criterion prevents overthinking. Under a fixed compute budget, VisRef improves visual reasoning accuracy by up to 6.4%.
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models: This paper proposes VisRef, a training-free visual refocusing framework that dynamically selects and re-injects semantically relevant and diverse visual tokens—chosen via a Determinantal Point Process (DPP)—into the reasoning context of Multimodal Large Reasoning Models (MLRMs) at each inference step, addressing the progressive decay of visual attention during long-chain reasoning. VisRef achieves improvements of up to 6.4% on benchmarks such as MathVista.