Skip to content

💬 LLM / NLP

📷 CVPR2026 · 9 paper notes

Bi-CMPStereo: Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Bi-CMPStereo is a bidirectional cross-modal prompting framework that alternately designates event and frame as the target domain for stereo canonicalization and cross-domain embedding adaptation, while leveraging cost volumes from both directions to achieve robust event-frame asymmetric stereo matching.

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

The QICA framework addresses the lack of quantity awareness and spatial insensitivity in zero-shot object counting by using a quantity-conditioned Synergistic Prompting Strategy (SPS) to jointly adapt vision-language encoders, combined with a Cost Aggregation Decoder (CAD) operating on similarity maps to preserve zero-shot transferability, achieving zero-shot SOTA on FSC-147 (MAE 12.41) with strong cross-domain generalization.

Composing Concepts from Images and Videos via Concept-prompt Binding

Bind & Compose (BiCo) is a one-shot method that binds visual concepts to prompt tokens via hierarchical binders and achieves flexible image-video concept composition through token-level composition, comprehensively outperforming prior work in concept consistency, prompt fidelity, and motion quality.

CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

CoPS is a framework that dynamically generates prompts through two visual conditioning mechanisms — Explicit State Token Synthesis (ESTS) and Implicit Category Token Sampling (ICTS) — combined with Spatially-Aware Global-local Alignment (SAGA), achieving zero-shot anomaly detection SOTA across 13 industrial and medical datasets.

GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

The paper proposes the GUIDE framework, which leverages in-context learning capabilities of LLMs to provide guided decision evolution for autonomous spacecraft operations, enabling progressive improvement of mission planning and fault diagnosis decisions through structured contextual information and feedback mechanisms without fine-tuning.

Perception Programs: Unlocking Visual Tool Reasoning in Language Models

Perception Programs (P2) is a training-free, model-agnostic method that converts raw visual tool outputs (depth, optical flow, correspondences, etc.) into compact language-native structured summaries, enabling MLLMs to directly "read" visual modalities rather than infer from dense pixels, achieving an average 19.66% improvement across 6 BLINK tasks.

PhysVid: Physics Aware Local Conditioning for Generative Video Models

PhysVid is a physics-aware local conditioning scheme that segments videos into temporal chunks, annotates each chunk with physics phenomenon descriptions via a VLM, and injects them through chunk-level cross-attention. At inference, "negative physics prompts" (counterfactual guidance) steer generation away from physics violations, improving physics commonsense scores by approximately 33% on VideoPhy.

Sign Language Recognition in the Age of LLMs

The first systematic evaluation of modern VLMs on zero-shot isolated sign language recognition (ISLR), revealing that open-source VLMs fall far behind specialized classifiers while large commercial models (GPT-5) demonstrate surprising potential.

SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation

SketchDeco is a training-free line-art colorization method that uses a global-local two-stage strategy with region masks and color palettes as precise control signals, leveraging diffusion model inversion and self-attention injection in latent space for region-accurate coloring with harmonious global transitions, completing in 15–20 steps on consumer GPUs.