Skip to content

📚 Pretraining

📷 CVPR2026 · 10 paper notes

Defending Unauthorized Model Merging via Dual-Stage Weight Protection

This paper proposes MergeGuard, a proactive dual-stage weight protection framework: Stage 1 disperses task-critical weights via L2 regularization, and Stage 2 injects structured perturbations to disrupt merging compatibility. The protected model retains <1.5% performance loss while causing merged model accuracy to drop by up to 90%.

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

This paper proposes the Evidential Transformation Network (ETN), a lightweight post-hoc module that learns sample-dependent affine transformations in logit space to convert pretrained classifiers or LLMs into evidential models, achieving reliable uncertainty estimation with minimal computational overhead.

FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

FlowMotion is a training-free video motion transfer framework that directly leverages the latent prediction output of flow-based T2V models to construct motion guidance signals, avoiding gradient backpropagation through internal model layers while maintaining motion fidelity and significantly reducing inference time and memory overhead.

Linking Modality Isolation in Heterogeneous Collaborative Perception

CodeAlign constructs a discrete code space via codebooks and cross-modal Feature-Code-Feature (FCF) translation, becoming the first framework to solve the "modality isolation" problem in heterogeneous collaborative perception—where different modalities never co-occur in training data—using only 8% of HEAL's training parameters with 1024x communication reduction while achieving SOTA perception performance.

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

LottieGPT is the first autoregressive vector animation generation framework, designing a Lottie tokenizer to encode hierarchical geometry, transforms, and keyframe motion into compact token sequences. It builds a 660K animation dataset and fine-tunes Qwen-VL to generate editable vector animations directly from text/image inputs.

Model Merging in the Essential Subspace

ESM constructs an "essential subspace" via PCA on activation shifts induced by parameter updates (rather than SVD on parameter matrices), and applies three-level polarized scaling to amplify critical parameters while suppressing noise. On 20-task ViT-B/32 merging, it improves over Iso-CTS by 3.2% absolute accuracy.

MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation

MXNorm fuses RMSNorm with MXFP quantization by reusing the block absmax values already computed during MXFP8 quantization to approximate the RMS value, eliminating the separate normalization reduction operation. It maintains training accuracy on Llama 3 up to 8B parameters while achieving up to 2.4x kernel speedup on GB200.

MXNorm: Reusing MXFP Block Scales for Efficient Tensor Normalisation

GPU matrix multiplication throughput has improved 80x (V100 to GB200) while reduction/elementwise operations improved only 5-9x, making RMSNorm a new bottleneck in low-precision training. MXNorm directly reuses the block scales already computed during MXFP8 quantization to estimate RMS, achieving a 32x reduction size decrease. Theorem 1 proves that the generalized \(p\)-mean of block absmax converges to a constant multiple of RMS. Llama 3 pretraining (125M/1B/8B) validates that MXNorm(\(p=2\)) matches RMSNorm with minimal accuracy difference, with torch.compile benchmarks showing up to 2.4x isolated kernel speedup and +1.3%/+2.6% Llama 3 8B layer acceleration for MXFP8/NVFP4. Drop-in replacement with zero additional hyperparameters.

Watch and Learn: Learning to Use Computers from Online Videos

Watch & Learn proposes using an inverse dynamics model (IDM) to automatically convert YouTube tutorial videos into executable UI trajectory data (53K+ trajectories without manual annotation), enhancing CUA capabilities with +11.1% improvement for Qwen 2.5VL-7B and +3.8% for UI-TARS-1.5-7B on OSWorld.

Watch and Learn: Learning to Use Computers from Online Videos

This paper proposes Watch & Learn (W&L), a framework that leverages an Inverse Dynamics Model (IDM) to automatically convert human computer-use tutorial videos from the internet into executable UI trajectory data. The system generates 53K+ high-quality trajectories that serve as either ICL demonstrations or SFT training data, significantly improving CUA performance across multiple models and platforms.