CVPR2025 Pretraining AI paper notes paper summaries Medical Imaging 3D Reconstruction Face & Gaze Watermarking Adversarial Robustness

📚 Pretraining¶

📷 CVPR2025 · 15 paper notes

📌 Same area in other venues: 📷 CVPR2026 (5) · 🔬 ICLR2026 (79) · 💬 ACL2026 (12) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51)

A Unified Framework for Heterogeneous Semi-supervised Learning: This paper proposes a new problem setting termed Heterogeneous Semi-Supervised Learning (HSSL), where labeled and unlabeled data originate from domains with different distributions, and the goal is to train a model that generalizes well to both domains. By expanding the C-class problem into a 2C-class classification task (where the same semantic class in different domains is treated as distinct classes), this work provides a unified solution integrating Weighted Moving Average (WMA) pseudo-labeling, cross-domain prototype alignment, and progressive cross-domain Mixup.
AMO Sampler: Enhancing Text Rendering with Overshooting: This paper proposes the Attention-Modulated Overshooting (AMO) sampler, a training-free inference-time enhancement method. By introducing an overshooting-noise compensation Langevin dynamics correction during the sampling process of rectified flow models, and adaptively controlling the overshooting intensity using text-image cross-attention scores, it significantly improves text rendering accuracy while maintaining the overall quality of generated images.
Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior: This work introduces the concepts of "System GAP" and "Random GAP" for the first time to describe the information mismatch between brain signals and visual stimuli. By dynamically adjusting the image blur level through an Uncertainty-Aware Blur Prior (UBP) to alleviate overfitting during training, it achieves a 50.9% top-1 accuracy on the 200-way zero-shot brain-image retrieval task, outperforming the previous SOTA by 13.7 percentage points.
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval: The ConText-CIR framework is proposed, which utilizes a Text Concept-Consistency loss to align noun phrases in text modifications with corresponding regions in the query image. Combined with a synthetic data generation pipeline, it achieves SOTA performance on multiple CIR benchmarks.
DreamText: High Fidelity Scene Text Synthesis: DreamText reconstructs the training pipeline of diffusion models, introducing character-level balanced supervision and a heuristic alternate optimization strategy to calibrate character attention. Combined with the joint training of the text encoder and generator to learn diverse font styles, it significantly outperforms state-of-the-art methods in scene text synthesis tasks (improving SeqAcc from 0.763 of UDiffText to 0.940).
Exploration-Driven Generative Interactive Environments: This work provides an open-source implementation of the Genie world model (GenieRedux), which is enhanced to GenieRedux-G by incorporating ground-truth action conditioning, Token Distance Cross-Entropy (TDCE) loss, and token skip connections. Additionally, the AutoExplore agent is proposed to utilize the world model's token prediction uncertainty as an intrinsic reward to drive diverse data collection, improving simulation quality by up to 7.4 PSNR.
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction: This paper proposes IAR, which rearranges the VQGAN codebook via balanced K-means to align similar embeddings with adjacent indices. Combined with a cluster-oriented cross-entropy loss that guides the model to correctly predict the semantic cluster of the target token, IAR halves the training time while improving generation quality across all LlamaGen scales from 100M to 1.4B.
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics: Reveals through the NTK framework that linearized attention mechanisms do not converge to the infinite-width NTK limit (the spectral amplification effect cubes the condition number of the Gram matrix, requiring a width of \(m = \Omega(\kappa^6)\)), and introduces the concept of "influence malleability" to quantify the dual consequences of this non-convergence: an attention network's malleability, which is 6-9 times higher than that of a ReLU network, both enhances task adaptability and exacerbates adversarial vulnerability.
MR-PLIP: Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation: Proposed MR-PLIP, the first multi-resolution pathology vision-language pre-training model. Pre-trained on 34 million multi-resolution image-text pairs from the TCGA dataset, it outperforms SOTA on 26 datasets through cross-resolution vision-text alignment and text-guided visual representation.
PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes: This paper proposes PlanarSplatting, which directly optimizes learnable 3D rectangular plane primitives. By utilizing a newly designed rectangular splatting function, planes are differentiably rendered into depth and normal maps. This enables the reconstruction of accurate indoor planar scenes from multi-view images in just 3 minutes without requiring any plane annotations.
Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance: An end-to-end trainable precise event spotting framework is proposed, which enhances the spatiotemporal information of features through an Adaptive Spatiotemporal Refinement Module (ASTRM) and introduces a Soft Instance Contrastive (SoftIC) loss to address the class imbalance problem. It surpasses the state-of-the-art with 73.74 mAP under the SoccerNet V2 tight setting.
Robust Message Embedding via Attention Flow-Based Steganography: This paper proposes the RMSteg (Robust Message Steganography) framework, which integrates the Transformer attention mechanism into normalizing flow networks (AttnFlow) for the first time. Combined with an invertible QR code transition and an invertible token fusion module, it achieves high-quality, high-capacity, and robust message-to-image steganography. The stego-images can be decoded accurately even after extreme distortions such as print-and-capture.
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model: This paper systematically validates the scaling law in the field of human motion generation for the first time. It proposes ScaMo, a scalable system comprising Motion FSQ-VAE (addressing codebook collapse), the 260-hour MotionUnion dataset, and a text-prefix autoregressive Transformer. The study discovers a logarithmic relationship between normalized test loss and FLOPs, as well as power-law relationships between vocabulary parameters, model parameters, data size, and FLOPs. Moreover, the optimal configuration is successfully predicted under a budget of \(1\times 10^{18}\) FLOPs.
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection: This paper proposes CLIP-PGS (Patch Generation-to-Selection), a simple yet effective masking strategy. Through a progressive "generation-to-selection" process—pre-selecting candidate masked patches, preserving critical semantic regions with Sobel edge detection, and then refining the selection using optimal transport normalization—it improves CLIP training efficiency (reducing training time to \(0.5\text{--}0.6\times\)) while achieving state-of-the-art (SOTA) performance on zero-shot classification, retrieval, and other tasks.
The Scene Language: Representing Scenes with Programs, Words, and Embeddings: Introduces Scene Language—a new paradigm representing visual scenes using a triplet \(\Phi(s)=(W,P,Z)\) of programs (\(P\), encoding hierarchical structure), words (\(W\), semantic categories), and embeddings (\(Z\), visual identity). It generates scene representations from text/image inputs via training-free inference using Claude 3.5 Sonnet, supports traditional/neural/hybrid rendering, and outperforms existing representations such as scene graphs in 3D/4D scene generation quality and controllable editing.