Skip to content

💬 LLM (Other)

📷 CVPR2025 · 15 paper notes

📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (56) · 💬 ACL2026 (62) · 🧪 ICML2026 (39) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (54)

Building Vision Models upon Heat Conduction

A vision backbone named vHeat is proposed, which models image patches as heat sources and utilizes physical heat conduction equations via DCT/IDCT transforms to achieve global information propagation with \(O(N^{1.5})\) complexity. It achieves 84.0% top-1 accuracy on ImageNet-1K with 3x higher throughput and 80% less GPU memory overhead.

Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

This paper proposes a new paradigm of Chat-based Person Retrieval (ChatPR), builds the first dialogue-image paired dataset ChatPedes, and designs the DiaNA framework to achieve fine-grained cross-modal alignment between dialogues and images via an adaptive attribute refiner, significantly outperforming traditional single-sentence text retrieval methods.

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

This paper proposes ComRoPE, which generalizes RoPE into a rotary position embedding parameterized by trainable commuting angle matrices. It theoretically proves that the pairwise commutativity of angle matrices is a necessary and sufficient condition for RoPE to satisfy relative position dependency, outperforming the state-of-the-art LieRE method by 1.6% (on training resolution) and 2.9% (on higher resolutions) on ImageNet-1K.

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Proposes Dora-VAE, which focuses on sharp geometric edge regions via Sharp Edge Sampling (SES) and handles uniform and salient sampled points separately using Dual Cross-Attention. It achieves superior 3D shape reconstruction quality with only 1,280 latent codes (8× fewer than XCube-VAE's 10,000+), while establishing a new evaluation benchmark, Dora-Bench.

Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention

This paper proposes the Exposure-slot framework, which extends the Slot Attention algorithm into a hierarchical slot-in-slot structure. Guided by learnable exposure prompts for feature clustering, it achieves exposure-centric region-aware representation learning, obtaining SOTA performance in under-/over-exposed image correction tasks.

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Proposed the IP-CIR method, which translates Composed Image Retrieval (CIR) into a standard image retrieval problem by using large language models to generate an "imagined target text description" as a proxy, achieving zero-shot SOTA on benchmarks such as CIRR and FashionIQ.

Learning Textual Prompts for Open-World Semi-Supervised Learning

This paper proposes a new method for open-world semi-supervised learning (OWSSL) that enhances vision-language alignment via a global-and-local textual prompt learning strategy, and designs a forward-and-backward strategy to reduce noise in vision-language matching for unlabeled samples, outperforming the SOTA significantly on multiple fine-grained datasets.

Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration

This paper proposes the MambaOFR framework to address the complex compound degradations unique to old films. It designs degradation-aware prompts to guide the Mamba model in dynamically adjusting restoration modes, incorporates a flow-guided masked deformable alignment module to prevent the propagation of structural defects, and introduces the first benchmark dataset for old film restoration containing both synthetic and real-world data.

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

MG-MotionLLM proposes a unified multi-granularity motion-language model. Leveraging a Motion VQ-VAE + T5 language model architecture along with a carefully designed multi-granularity synergy pre-training scheme (comprising 28 tasks), it simultaneously supports coarse- and fine-grained motion comprehension and generation. While achieving state-of-the-art performance on classic tasks, it also enables novel applications such as fine-grained motion editing.

Rethinking Spiking Self-Attention Mechanism: Implementing a-XNOR Similarity Calculation in Spiking Transformers

This paper provides an in-depth analysis of the fundamental reasons why the dot product fails as a similarity metric in spiking query-key pairs due to a large number of "non-spiking events." It proposes the a-XNOR similarity metric specifically designed for spike sequences, redefining the correlation of non-spiking pairs as a specific value \(a\). This approach significantly improves performance across various spiking Transformer architectures and datasets.

Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer

This paper proposes Accurate Addition-Only Spiking Self-Attention (A²OS²A), which significantly improves Spiking Transformer accuracy by leveraging a hybrid strategy that fuses binary, ReLU, and ternary spiking neurons while maintaining pure addition-only computation (no multiplication), achieving 78.66% on ImageNet-1K.

Spiking Transformer with Spatial-Temporal Attention

Spatially-temporally decoupled attention is integrated into the Spiking Transformer architecture. By combining spatial-temporal decoupled attention designs with a spike-driven self-attention mechanism, the approach bridges the performance gap with ANNs while preserving the energy efficiency advantages of SNNs, achieving SOTA performance on multiple vision benchmarks.

STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks

By integrating four major modules—Global Context self-attention (GC), Position Encoding (PE), Step Attention (SA), and Timestep Random Dropout (TSRD)—into SNNs, STAA-SNN achieves state-of-the-art (SOTA) performance on CIFAR-10/100 and ImageNet with accuracies of 97.14%/82.05%/70.40%, respectively.

Test-Time Visual In-Context Tuning

This paper proposes Visual In-Context Tuning (VICT), which performs one-shot adaptation of visual in-context learning models (e.g., Painter) at test time by flipping the roles of task prompts and test samples and utilizing cycle consistency loss, significantly improving generalization under distribution shifts.

The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

This paper proposes HySCDG (Hybrid Semantic Change Detection Data Generation), a hybrid data generation pipeline that combines real very-high-resolution (VHR) remote sensing imagery with image inpainting techniques to generate large-scale training data for semantic change detection, achieving strong temporal and spatial generalization capabilities under a simple architectural design.