Skip to content

⚡ LLM Efficiency

📷 CVPR2025 · 5 paper notes

📌 Same area in other venues: 📷 CVPR2026 (8) · 🔬 ICLR2026 (171) · 💬 ACL2026 (23) · 🧪 ICML2026 (48) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (34)

Associative Transformer

The Associative Transformer (AiT) is proposed, which integrates a learnable explicit memory module and a Hopfield network for token reconstruction within the Transformer architecture, achieving classification and relational reasoning performance superior to ViT with fewer parameters.

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

A method is proposed to automatically extract MoE (Mixture-of-Experts) variants from pre-trained ViTs. By first clustering the output activation patterns of MLP layers and then extracting corresponding subnetworks as experts, this approach avoids training MoEs from scratch. It recovers 98% of the original performance on ImageNet-1k with only minimal fine-tuning, while reducing FLOPs and model size by 36% and 32%, respectively.

LOCORE: Image Re-ranking with Long-Context Sequence Modeling

This paper proposes LoCoRe (Long-Context Re-ranker), achieving list-wise image re-ranking based on local descriptors for the first time. By leveraging the Longformer long-context sequence model to process the local descriptors of both the query image and the entire candidate list simultaneously, LoCoRe significantly improves re-ranking performance by capturing transitive relations among candidate images.

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

A post-training method is proposed to extract MoE variants from pre-trained ViTs. By automatically discovering expert structures using HDBSCAN to cluster MLP hidden layer activation patterns, it reduces MACs by 36% and parameters by 32% on ImageNet-1k while preserving 98% of the original accuracy without retraining.

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

This paper proposes Spatial-TTT, which leverages the Test-Time Training (TTT) mechanism to utilize a subset of model parameters (fast weights) as compact non-linear memory. Combined with a hybrid architecture and a spatial prediction mechanism, the model continuously accumulates and organizes 3D spatial evidence from unbounded video streams, achieving SOTA on video spatial understanding benchmarks.