📚 Pretraining¶

📹 ICCV2025 · 10 paper notes

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training: ACE-G decomposes a scene coordinate regressor (SCR) into a general-purpose Transformer and a scene-specific map code, and pre-trains the Transformer across tens of thousands of scenes to learn generalization from mapping images to unseen query images. This significantly improves relocalization robustness under illumination and viewpoint changes while maintaining computational efficiency.
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training: ACE-G decomposes a scene coordinate regressor into a scene-agnostic Transformer and a scene-specific map code, and achieves significant generalization gains under illumination and viewpoint variation by conducting alternating mapping/query pre-training across tens of thousands of scenes, while maintaining lightweight computational overhead.
ConstStyle: Robust Domain Generalization with Unified Style Transformation: This paper proposes ConstStyle, a framework that constructs a theoretically grounded Unified Domain to which all training samples are style-aligned during training, while test samples from unseen domains are partially projected toward this unified domain at inference time, effectively reducing the domain gap and improving generalization performance.
Dataset Ownership Verification for Pre-trained Masked Models: DOV4MM proposes the first dataset ownership verification method tailored for masked pre-trained models. By comparing the embedding reconstruction difficulty of seen versus unseen samples, and applying a paired t-test, the method determines whether a black-box model was pre-trained on a specific dataset. It achieves p-values well below 0.05 across 10 masked image models and 4 masked language models.
ETA: Energy-based Test-time Adaptation for Depth Completion: This paper proposes ETA, a method that employs an energy-based model to quantify the likelihood of depth predictions belonging to the source domain distribution, and guides a pre-trained depth completion model to adapt to new environments at test time by minimizing the energy of target-domain predictions. ETA achieves average improvements of 6.94% and 10.23% over the previous state of the art on outdoor and indoor scenes, respectively.
FlowMo: Flow to the Mode — Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization: This paper proposes FlowMo, a Transformer-based diffusion autoencoder trained in two stages (mode-matching pretraining + mode-seeking post-training), achieving state-of-the-art performance on ImageNet-1K discrete image tokenization for the first time among diffusion autoencoders — without convolutions, adversarial losses, 2D spatially-aligned latents, or distillation from other tokenizers.
Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution: This paper introduces Image Intrinsic Scale (IIS)—the maximum scaling factor at which an image exhibits its highest perceptual quality—and proposes the IISA task, constructs a dataset of 785 images with expert annotations, and presents a weak-label training strategy (WIISA) that consistently improves IIS prediction across multiple NR-IQA methods.
Make Your Training Flexible: Towards Deployment-Efficient Video Models: This paper proposes Flux — a data augmentation tool that enables flexible video model training through flexible sampling grids and group-dynamic token selection, allowing a single model to operate efficiently across varying computational budgets. The paper further introduces a Token Optimization test-time paradigm that matches previous SOTA performance using only 1/4 of the tokens, saving approximately 90% of computation.
Synchronization of Multiple Videos: This paper proposes Temporal Prototype Learning (TPL), a prototype-based video synchronization framework that constructs shared compact 1D representations from high-dimensional embeddings extracted by pretrained models. By learning a unified prototype sequence to anchor key action phases, TPL aligns multiple videos jointly and, for the first time, addresses the synchronization of generative AI videos.
SynCity: Training-Free Generation of 3D Worlds: SynCity proposes a training- and optimization-free method for 3D world generation. Through carefully designed prompt engineering strategies, it combines a pretrained language model, a 2D image generator (Flux), and a 3D generator (TRELLIS) to autoregressively synthesize large-scale, high-quality, freely navigable 3D scenes in a tile-by-tile fashion.