Skip to content

📚 Pretraining

📷 CVPR2026 · 5 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (79) · 💬 ACL2026 (12) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51) · 📹 ICCV2025 (9)

Exploring Visual Pretraining for Learning Language Intelligence

This paper proposes MAPLE: instead of extracting text from PDFs to feed into LLMs, it directly performs masked autoregressive pretraining on document page images. By allowing the LLM to learn language intelligence through "generating latent hypotheses for occluded regions," it achieves an average improvement of up to 40.2% over pure text pretraining across four mathematical reasoning benchmarks.

Linking Modality Isolation in Heterogeneous Collaborative Perception

The CodeAlign framework is proposed to address the "modality isolation" problem in heterogeneous collaborative perception, where different modalities never co-occur in training data. By constructing discrete code spaces via codebooks and performing cross-modal Feature-Code-Feature (FCF) translation, it achieves SOTA perception performance with only 8% of HEAL's training parameters and a \(1024\times\) reduction in communication volume.

Reconstructing CLIP for Open-Vocabulary Dense Perception

DenseRC addresses the neglected problem of "how to construct high-quality dense features for CLIP." It reveals that the generalized semantics of the cls token actually derive from multi-layer value embeddings, whereas spatial aggregation tends to amplify semantic misalignment. By using multi-layer values as a foundation and employing a lightweight Head Selection Gating (HSG) for re-weighting solely across the head dimension, the authors construct dense representations aligned with global semantics. DenseRC sets new SOTAs on multiple open-vocabulary detection and segmentation benchmarks.

Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization

PITH utilizes a Graph HyperNetwork to dynamically generate "projection matrices" that map internal weights of large pre-trained models directly onto target ViTs of arbitrary sizes for initialization. This enables the initialized networks to be used immediately without training—achieving a zero-shot accuracy of 53.35% for ViT-Base on ImageNet-1K, which is 6.54% higher than the previous SOTA (TAL).

Watch and Learn: Learning to Use Computers from Online Videos

The Watch & Learn (W&L) framework is proposed, which automatically transforms human computer-operation videos from the internet into executable UI trajectory data using an Inverse Dynamics Model (IDM). It generates 53K+ high-quality trajectories, significantly improving the performance of various Computer-Using Agents (CUAs) when used as In-Context Learning (ICL) examples or Supervised Fine-Tuning (SFT) data.