💬 LLM (Other)¶
📷 CVPR2025 · 15 paper notes
📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (56) · 💬 ACL2026 (62) · 🧪 ICML2026 (39) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (54)
- Building Vision Models upon Heat Conduction
-
A vision backbone named vHeat is proposed, which models image patches as heat sources and utilizes physical heat conduction equations via DCT/IDCT transforms to achieve global information propagation with \(O(N^{1.5})\) complexity. It achieves 84.0% top-1 accuracy on ImageNet-1K with 3x higher throughput and 80% less GPU memory overhead.
- Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
-
This paper proposes a new paradigm of Chat-based Person Retrieval (ChatPR), builds the first dialogue-image paired dataset ChatPedes, and designs the DiaNA framework to achieve fine-grained cross-modal alignment between dialogues and images via an adaptive attribute refiner, significantly outperforming traditional single-sentence text retrieval methods.
- ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
-
This paper proposes ComRoPE, which generalizes RoPE into a rotary position embedding parameterized by trainable commuting angle matrices. It theoretically proves that the pairwise commutativity of angle matrices is a necessary and sufficient condition for RoPE to satisfy relative position dependency, outperforming the state-of-the-art LieRE method by 1.6% (on training resolution) and 2.9% (on higher resolutions) on ImageNet-1K.
- Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
-
Proposes Dora-VAE, which focuses on sharp geometric edge regions via Sharp Edge Sampling (SES) and handles uniform and salient sampled points separately using Dual Cross-Attention. It achieves superior 3D shape reconstruction quality with only 1,280 latent codes (8× fewer than XCube-VAE's 10,000+), while establishing a new evaluation benchmark, Dora-Bench.
- Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention
-
This paper proposes the Exposure-slot framework, which extends the Slot Attention algorithm into a hierarchical slot-in-slot structure. Guided by learnable exposure prompts for feature clustering, it achieves exposure-centric region-aware representation learning, obtaining SOTA performance in under-/over-exposed image correction tasks.
- Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
-
Proposed the IP-CIR method, which translates Composed Image Retrieval (CIR) into a standard image retrieval problem by using large language models to generate an "imagined target text description" as a proxy, achieving zero-shot SOTA on benchmarks such as CIRR and FashionIQ.
- Learning Textual Prompts for Open-World Semi-Supervised Learning
-
This paper proposes a new method for open-world semi-supervised learning (OWSSL) that enhances vision-language alignment via a global-and-local textual prompt learning strategy, and designs a forward-and-backward strategy to reduce noise in vision-language matching for unlabeled samples, outperforming the SOTA significantly on multiple fine-grained datasets.
- Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
-
This paper proposes the MambaOFR framework to address the complex compound degradations unique to old films. It designs degradation-aware prompts to guide the Mamba model in dynamically adjusting restoration modes, incorporates a flow-guided masked deformable alignment module to prevent the propagation of structural defects, and introduces the first benchmark dataset for old film restoration containing both synthetic and real-world data.
- MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
-
MG-MotionLLM proposes a unified multi-granularity motion-language model. Leveraging a Motion VQ-VAE + T5 language model architecture along with a carefully designed multi-granularity synergy pre-training scheme (comprising 28 tasks), it simultaneously supports coarse- and fine-grained motion comprehension and generation. While achieving state-of-the-art performance on classic tasks, it also enables novel applications such as fine-grained motion editing.
- Rethinking Spiking Self-Attention Mechanism: Implementing a-XNOR Similarity Calculation in Spiking Transformers
-
This paper provides an in-depth analysis of the fundamental reasons why the dot product fails as a similarity metric in spiking query-key pairs due to a large number of "non-spiking events." It proposes the a-XNOR similarity metric specifically designed for spike sequences, redefining the correlation of non-spiking pairs as a specific value \(a\). This approach significantly improves performance across various spiking Transformer architectures and datasets.
- Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer
-
This paper proposes Accurate Addition-Only Spiking Self-Attention (A²OS²A), which significantly improves Spiking Transformer accuracy by leveraging a hybrid strategy that fuses binary, ReLU, and ternary spiking neurons while maintaining pure addition-only computation (no multiplication), achieving 78.66% on ImageNet-1K.
- Spiking Transformer with Spatial-Temporal Attention
-
Spatially-temporally decoupled attention is integrated into the Spiking Transformer architecture. By combining spatial-temporal decoupled attention designs with a spike-driven self-attention mechanism, the approach bridges the performance gap with ANNs while preserving the energy efficiency advantages of SNNs, achieving SOTA performance on multiple vision benchmarks.
- STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
-
By integrating four major modules—Global Context self-attention (GC), Position Encoding (PE), Step Attention (SA), and Timestep Random Dropout (TSRD)—into SNNs, STAA-SNN achieves state-of-the-art (SOTA) performance on CIFAR-10/100 and ImageNet with accuracies of 97.14%/82.05%/70.40%, respectively.
- Test-Time Visual In-Context Tuning
-
This paper proposes Visual In-Context Tuning (VICT), which performs one-shot adaptation of visual in-context learning models (e.g., Painter) at test time by flipping the roles of task prompts and test samples and utilizing cycle consistency loss, significantly improving generalization under distribution shifts.
- The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation
-
This paper proposes HySCDG (Hybrid Semantic Change Detection Data Generation), a hybrid data generation pipeline that combines real very-high-resolution (VHR) remote sensing imagery with image inpainting techniques to generate large-scale training data for semantic change detection, achieving strong temporal and spatial generalization capabilities under a simple architectural design.