📚 Pretraining¶
📹 ICCV2025 · 9 paper notes
📌 Same area in other venues: 📷 CVPR2026 (5) · 🔬 ICLR2026 (79) · 💬 ACL2026 (12) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51)
- ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
-
ACE-G decomposes a scene coordinate regressor into a scene-agnostic Transformer and a scene-specific map code, and achieves significant generalization gains under illumination and viewpoint variation by conducting alternating mapping/query pre-training across tens of thousands of scenes, while maintaining lightweight computational overhead.
- ConstStyle: Robust Domain Generalization with Unified Style Transformation
-
This paper proposes ConstStyle, a framework that constructs a theoretically grounded Unified Domain to which all training samples are style-aligned during training, while test samples from unseen domains are partially projected toward this unified domain at inference time, effectively reducing the domain gap and improving generalization performance.
- Dataset Ownership Verification for Pre-trained Masked Models
-
DOV4MM proposes the first dataset ownership verification method tailored for masked pre-trained models. By comparing the embedding reconstruction difficulty of seen versus unseen samples, and applying a paired t-test, the method determines whether a black-box model was pre-trained on a specific dataset. It achieves p-values well below 0.05 across 10 masked image models and 4 masked language models.
- ETA: Energy-based Test-time Adaptation for Depth Completion
-
This paper proposes ETA, a method that employs an energy-based model to quantify the likelihood of depth predictions belonging to the source domain distribution, and guides a pre-trained depth completion model to adapt to new environments at test time by minimizing the energy of target-domain predictions. ETA achieves average improvements of 6.94% and 10.23% over the previous state of the art on outdoor and indoor scenes, respectively.
- FlowMo: Flow to the Mode — Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
-
This paper proposes FlowMo, a Transformer-based diffusion autoencoder trained in two stages (mode-matching pretraining + mode-seeking post-training), achieving state-of-the-art performance on ImageNet-1K discrete image tokenization for the first time among diffusion autoencoders — without convolutions, adversarial losses, 2D spatially-aligned latents, or distillation from other tokenizers.
- Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
-
This paper introduces Image Intrinsic Scale (IIS)—the maximum scaling factor at which an image exhibits its highest perceptual quality—and proposes the IISA task, constructs a dataset of 785 images with expert annotations, and presents a weak-label training strategy (WIISA) that consistently improves IIS prediction across multiple NR-IQA methods.
- Make Your Training Flexible: Towards Deployment-Efficient Video Models
-
This paper proposes Flux — a data augmentation tool that enables flexible video model training through flexible sampling grids and group-dynamic token selection, allowing a single model to operate efficiently across varying computational budgets. The paper further introduces a Token Optimization test-time paradigm that matches previous SOTA performance using only 1/4 of the tokens, saving approximately 90% of computation.
- Synchronization of Multiple Videos
-
This paper proposes Temporal Prototype Learning (TPL), a prototype-based video synchronization framework that constructs shared compact 1D representations from high-dimensional embeddings extracted by pretrained models. By learning a unified prototype sequence to anchor key action phases, TPL aligns multiple videos jointly and, for the first time, addresses the synchronization of generative AI videos.
- SynCity: Training-Free Generation of 3D Worlds
-
SynCity proposes a training- and optimization-free method for 3D world generation. Through carefully designed prompt engineering strategies, it combines a pretrained language model, a 2D image generator (Flux), and a 3D generator (TRELLIS) to autoregressively synthesize large-scale, high-quality, freely navigable 3D scenes in a tile-by-tile fashion.