CVPR2025 LLM Reasoning AI paper notes paper summaries Reasoning LLM Agents Few-/Zero-Shot Learning Object Detection

💡 LLM Reasoning¶

📷 CVPR2025 · 7 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (82)

🔥 Top topics: Reasoning ×6

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought: Argus proposes a grounded visual CoT mechanism that enables explicit target-oriented visual attention by first making the MLLM predict a question-related bounding box (RoI), and then resampling/re-encoding the visual tokens of that region as reasoning context, achieving dual SOTA in visual reasoning and object grounding among 7B/8B-scale MLLMs.
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation: AoTD uses an LLM agent to decompose complex video questions into subtasks, invokes expert vision models to execute them, and collects intermediate results as a Chain-of-Thought (CoT). After quality filtering using an LLM, the CoT is distilled into a Video-LLM, enabling the end-to-end model to achieve both accurate answers and interpretable multi-step reasoning capabilities.
Interleaved-Modal Chain-of-Thought: Proposes Interleaved-Modal Chain-of-Thought (ICoT), which interleaves image region crops as visual rationales within reasoning steps. By using a parameter-free Attention-driven Selection (ADS) to intelligently select and insert key regions from the input image into the generated sequence, it achieves up to a 14% improvement over existing multimodal CoTs on Chameleon and Qwen2-VL.
Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework: This paper proposes a learning-enabled polynomial Lyapunov function synthesis method which combines learning and verification. It uses data-driven machine learning to guide the selection of polynomial forms and iteratively optimizes them through a high-accuracy counterexample-guided framework, striking a balance between flexibility and mathematical rigor.
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval: This paper proposes OSrCIR, a training-free one-stage zero-shot composed image retrieval method. It utilizes multimodal large language models to directly process the reference image and modification text, and accurately understands the user's implicit intent through reflective Chain-of-Thought reasoning, outperforming existing training-free methods by 1.80% to 6.44% across multiple benchmarks.
Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection: This paper proposes a Chain-of-Thought Guided Style Evolution (CGSE) method. By generating three-level progressive style descriptions (word \(\rightarrow\) phrase \(\rightarrow\) sentence), combined with feature disentanglement and class-specific prototype clustering, CGSE achieves significant performance improvements in domain generalization object detection on five adverse weather scenarios and the Real-to-Art benchmark.
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection: VideoEspresso constructs a large-scale video CoT reasoning dataset of over 200k samples (containing spatial bounding box and temporal grounding annotations). It also proposes a hybrid framework, VideoQA-SC, which employs a lightweight 1.5B model to select an average of 2.36 core frames, followed by an 8B reasoning model performing two-stage evidence extraction and answer generation. With only 1.8% of the frames and 14.7% of the computation, it outperforms GPT-4o and all open-source LVLMs.