📚 Pretraining¶

🎞️ ECCV2024 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (5) · 🔬 ICLR2026 (79) · 💬 ACL2026 (12) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51)

🔥 Top topics: Few-/Zero-Shot Learning ×2

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision: Proposes a weakly-supervised cross-domain learning (CDL) framework that integrates unlabeled external videos into training via an uncertainty-driven pseudo-labeling mechanism, significantly improving the cross-domain generalization capability of video anomaly detection.
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects: DragAPart proposes an image generator that uses dragging as an interactive interface, capable of responding to part-level interactions (such as opening/closing drawers/doors) rather than merely moving the entire object. Through the new synthetic dataset Drag-a-Move, multi-resolution drag encoding, and domain randomization strategies, the model generalizes well to real images and unseen categories despite being trained solely on synthetic data.
I Can't Believe It's Not Scene Flow!: Reveals that the catastrophic failure of existing scene flow methods on small objects like pedestrians is masked by current evaluation metrics, and proposes a category-aware and velocity-normalized Bucket Normalized EPE evaluation protocol, alongside a simple yet SOTA baseline, TrackFlow (generating scene flow from a detector + tracker), achieving a 1.5x improvement in pedestrian motion description.
Learning to Obstruct Few-Shot Image Classification over Restricted Classes: The Learning to Obstruct (LTO) algorithm is proposed, which modifies pre-trained backbone parameters via a MAML-like meta-learning approach to make them a "bad initialization" for specific restricted classes. This hinders the fine-tuning performance of few-shot classification methods on restricted classes while maintaining normal performance on other classes.
Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation: This paper proposes PRO-Motion, a divide-and-conquer framework that decomposes text-to-motion generation into three stages: LLM-driven motion planning (Plan), script-based posture diffusion generation (Posture), and global translation and rotation estimation (Go). By reducing the complexity of each stage, it achieves high-quality open-vocabulary motion generation.
PreLAR: World Model Pre-training with Learnable Action Representation: This paper proposes PreLAR to bridge the gap between action-free pre-training and action-conditioned fine-tuning for world models. By encoding implicit action representations from adjacent frames and designing an action-state consistency loss during unsupervised pre-training on action-free videos, PreLAR significantly improves the sample efficiency of downstream visual control tasks.
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning: This paper proposes the PLID method, which leverages sentence-level category descriptions generated by LLMs to construct language-knowledge-driven Gaussian distributions. Combined with vision-language primitive decomposition and randomized logit fusion, it achieves state-of-the-art (SOTA) performance on the Compositional Zero-Shot Learning (CZSL) task.
Scaling Backwards: Minimal Synthetic Pre-training?: Proposes 1p-frac—achieving pre-training performance comparable to the ImageNet-1k level using minute perturbations of a single fractal image. This challenges the conventional wisdom that "pre-training requires large-scale datasets" and reveals that the essence of pre-training might be closer to weight initialization than visual concept learning.