Skip to content

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

Conference: ICML 2026
arXiv: 2603.07615
Code: Yes (Official)
Area: Image Generation/Visual Compression
Keywords: Implicit representation, Diffusion models, Visual compression, LoRA, Inference-time scaling

TL;DR

Visual signals are encoded as Low-Rank Adaptation (LoRA) parameters on a frozen diffusion foundation model and compressed into a single compact vector via hash mapping, achieving high-perceptual-quality video compression at extremely low bitrates while supporting inference-time scaling and generative editing.

Background & Motivation

Background: Large-scale visual generative models (e.g., Wan-2.1, Qwen) have acquired rich visual knowledge through massive data training. However, visual signals themselves still exist as external explicit representations such as pixels, latent variables, or tokens, failing to directly utilize the internal prior knowledge learned by the model. Traditional video compression (H.265/H.266) and neural codecs encode signals into explicit latent codes via VAEs, where signal-specific information is stored entirely in the latent code, while the decoder is shared across signals but lacks signal-specific information.

Limitations of Prior Work: Although Implicit Neural Representations (INR) can parameterize signals as small MLPs, these networks are trained from scratch and are completely decoupled from the high-level visual knowledge of large-scale pretrained models, limiting their compression capability. Even recent works combining INR with diffusion processes fail to truly leverage the semantic priors encoded in foundation models.

Key Challenge: Explicit representation separates "what the signal is" from "what the model knows," leading to representation redundancy—the model already "knows" what natural images/videos look like, but cannot utilize this knowledge during compression.

Goal: Instead of compressing "what the visual signal is," this work compresses "how to generate the visual signal"—representing the visual signal as a generation function of the diffusion model, using minimal parameter deviation to describe the adaptation process from the pretrained model to the target signal.

Core Idea: Use LoRA for single-example fine-tuning on a frozen diffusion model, map the adaptation parameters to a single vector \(\mathbf{v} \in \mathbb{R}^{1 \times k}\) via a pseudo-random hash mapping, and apply entropy-constrained quantization to compress an 81-frame video into a single compact vector.

Method

Overall Architecture

Given a visual signal \(x\) (e.g., an 81-frame 480p video), a detailed caption \(c\) is first generated as a condition using a VLM (e.g., GPT-5.1). Then, on a frozen video diffusion model, the LoRA parameters are overfitted to the single example using a flow-matching objective \(\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t,\epsilon}[\|v_\theta(x_t, t) - (\epsilon - x)\|^2]\). The optimized LoRA parameters are compressed into a single vector \(\mathbf{v}\) through a PRNG-driven hash mapping, followed by quantization and entropy coding to produce the final bitstream. At the decoding end, the same foundation model and the decoded \(\mathbf{v}\) are used to recover the LoRA weights, reconstructing the video via ODE/SDE sampling.

Key Designs

  1. One-Vector Adaptation:

    • Function: Compresses LoRA parameters from all layers into a single shared vector, significantly reducing the parameter count.
    • Mechanism: For each pretrained weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{m \times n}\), LoRA introduces \(\Delta\mathbf{W} = \mathbf{AB}\) (where \(r \ll \min(m,n)\)). Since large models have many layers, the total LoRA parameters remain substantial. Borrowing the hashing trick (Chen et al., 2015), a fixed random projection generated by a PRNG maps all layers' LoRA parameters to a single shared vector \(\mathbf{v} \in \mathbb{R}^{1 \times k}\), forcing cross-layer parameter sharing. Learnable scaling parameters \(s\) are introduced for normalization followed by uniform quantization (replaced by additive uniform noise during training). A factorized entropy model estimates the bitrate, constraining it to 1-3 bits per parameter.
    • Design Motivation: To achieve ultra-low bitrate compression—an 81-frame video is represented by only one vector, with the overhead of the caption/entropy model parameters accounting for less than 1% of the total bitrate.
  2. Inference-Time Scaling:

    • Function: Improves reconstruction quality utilizing additional decoding computation without changing the compressed representation.
    • Mechanism: The encoder uses an SDE formulation for denoising, generating \(M\) candidate particles via a shared PRNG at each step. Since the encoder has access to the original signal \(x\), it can calculate the optimal denoising kernel \(p^*(x_{t_{n-1}}|x_{t_n})\) and perform importance sampling on the model’s predicted kernel \(p(x_{t_{n-1}}|x_{t_n})\), selecting the particle with the largest weight \(w^{(m)} \propto p^*(x_{t_{n-1}}^{(m)})/p(x_{t_{n-1}}^{(m)})\). Only the selection index per step (minimal side information) needs to be transmitted, and the decoder reproduces the choice using the same PRNG. Scaling occurs along two axes: candidates per step (affecting only encoding) and denoising steps (affecting both).
    • Design Motivation: A unique advantage of functional representation—the representation itself is part of the generative process, which can still be controlled and optimized after encoding, a capability traditional explicit codecs lack. Scaling is equivalent to relative entropy coding (Diff-C), using the adapted diffusion model as a stronger prior to reduce coding complexity.
  3. Minimum Description Length (MDL) Interpretation:

    • Function: Justifies the training objective as naturally seeking the simplest generation function from an information-theoretic perspective.
    • Mechanism: The pretrained model defines a path measure \(\mathbb{P}\) on the SDE trajectory space, and the adapted model defines \(\mathbb{P}'\). The optimal compression objective is \(\min_{\mathbb{P}'} D_{\text{KL}}[\mathbb{P}' \| \mathbb{P}]\) such that the terminal state \(x_0 = x\). The optimal solution is the Doob’s-\(h\) transform of \(\mathbb{P}\) conditioned on the terminal state. When the pretrained model is perfect, minimizing the flow-matching objective exactly recovers this solution.
    • Design Motivation: Provides theoretical support for "compression as adaptation"—implicit representation only needs to encode the minimum deviation from the pretrained model, naturally leveraging the model prior.

Key Experimental Results

Main Results: UVG Perceptual Video Compression

Method Bitrate (bpp) DISTS ↓ FVD ↓ PSNR ↑
H.265/HM ~0.015 Higher Higher ~30
H.266/VTM ~0.015 Medium Medium ~32
DCVC-RT (MSE) ~0.012 Medium Medium ~31
GLC-Video (Perceptual) ~0.012 Medium Medium ~28
VOV (Ours) ~0.011 Best Best ~24
VOV + Scaling ~0.011 Better Better ~26

VOV significantly outperforms all baselines in perceptual metrics (DISTS and FVD), especially at ultra-low bitrates where visual quality far exceeds traditional codecs. Higher DISTS/FVD scores but lower PSNR reflect that generative reconstruction prioritizes perceptual quality over exact pixel alignment.

Ablation Study: Inference-Time Scaling Strategy

Scaling Config Denoising Steps Candidates per Step DISTS ↓ Effect
No Scaling (ODE) 50 1 Baseline No improvement
Step Increase Only 100 1 ≈Baseline Almost ineffective
Multi-candidate + Few Steps 100 \(2^{18}\) Significant gain Increased encoder compute only
Multi-candidate + Multi-step 1000 \(2^{10}\) Significant gain Increased compute for both

Key Findings

  • Counter-intuitive interaction between \(k\) and LoRA rank: With a fixed vector size, increasing the LoRA rank leads to decreased reconstruction quality—higher rank adaptation introduces more densely entangled parameter updates that the fixed-size hashing scheme struggles to preserve.
  • Interchangeable scaling paths: The gain from increasing candidates per step from \(2^{10}\) to \(2^{18}\) is comparable to doubling the denoising steps, though the latter requires more network evaluations.
  • Pure scaling (no adaptation) can also compress: Using the original pretrained model with inference-time scaling alone can achieve strong compression, but codec costs are much higher; LoRA adaptation makes decoding lightweight.
  • Unified compression and generation: The adapted model allows for personalized editing (e.g., changing colors, merging images, changing resolution) by modifying text prompts, though it may introduce biases from the training data.

Highlights & Insights

  • Paradigm shift of "Compression as Adaptation": Redefines compression as finding the minimum deviation adaptation on a pretrained model, naturally utilizing foundation model priors. This concept is transferable to any modality with strong pretrained models (audio, 3D, etc.).
  • Controllability of functional representation: Unlike fixed bitstreams, implicit representation allows for tuning output quality after encoding via inference-time scaling or early stopping—enabling "encode once, decode at multiple qualities."
  • Extreme compression via hash mapping: Mapping thousands of LoRA parameters to a single vector via fixed random projections is conceptually simple yet surprisingly effective—transforming an 81-frame video into a single vector.

Limitations & Future Work

  • Dependence on foundation model capability: Semantic mismatches occasionally occur during reconstruction (especially text in videos); model capacity directly determines the compression upper bound.
  • Slow encoding speed: Single-example overfitting plus inference-time scaling results in high encoding costs, a common pain point for INR-based methods.
  • Limitations of hash mapping: Random projections may fail to effectively capture correlations between adaptation parameters; learnable amortized encoders/decoders (Vector ↔ LoRA) are a clear direction for improvement.
  • Personalized editing bias: Modifying prompts might introduce statistical biases (e.g., racial associations) from the training data, requiring better disentanglement methods.
  • INR Compression: Works like NVRC use small MLPs to parameterize signals; ours replaces the "network" with adaptation parameters on a large model, inheriting INR's functional advantages while introducing pretrained priors.
  • LoRA Personalization: DreamBooth/Custom Diffusion use LoRA for concept customization; this work finds the same mechanism is an effective compression tool, revealing a deep unification between generation and compression.
  • Diff-C / Relative Entropy Coding: The inference-time scaling algorithm is equivalent to Diff-C, using the adapted diffusion model as a stronger prior to reduce coding costs.
  • Amortized Inference: Learning an amortized decoder from vectors to LoRA is a key future direction to simultaneously accelerate encoding and improve compression rates.