Skip to content

Latent Stochastic Interpolants

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=txiGUfI4yF
Code: To be confirmed
Area: Image Generation / Generative Models
Keywords: Stochastic Interpolants, Latent space generation, Continuous-time ELBO, Diffusion bridge, Joint training

TL;DR

This paper proposes Latent Stochastic Interpolants (LSI), which utilizes a single ELBO objective derived from continuous time to bring the Stochastic Interpolants framework into an end-to-end jointly trained latent space. By optimizing the encoder, decoder, and the latent SI generative model together, LSI achieves FIDs comparable to pixel-space SI on ImageNet with significantly lower sampling FLOPs.

Background & Motivation

  • Background: Stochastic Interpolants (SI) is a unified framework for diffusion-like generation that flexibly bridges any two distributions (not limited to Gaussian priors). It constructs an interpolant \(x_t=(1-t)x_0+tx_1+\sqrt{t(1-t)}\,\epsilon\) to learn a velocity field and score, followed by efficient training via a simulation-free objective.
  • Limitations of Prior Work: SI requires both the prior \(p_0\) and the target \(p_1\) to be fixed and directly observable. This restricts it to the observation space. To learn a generative model in a low-dimensional latent space, the target distribution becomes the aggregated posterior \(p_1(z_1)=\int p_\theta(z_1|x_1)\,dx_1\), which evolves with the encoder/decoder and is unobservable. Thus, one cannot directly construct a latent interpolant that satisfies SI marginal constraints.
  • Key Challenge: Running SI directly in a high-dimensional observation space is computationally expensive. Attempting to use a latent space to reduce cost often fails because the posterior is "dynamic and unobservable." Existing latent diffusion models often revert to simple Gaussian priors or rely on ad-hoc multi-阶段 training (pre-training an autoencoder before the generator), which might result in a misalignment between the latent representation and the generative process.
  • Goal: To jointly learn the encoder, decoder, and SI generative model in a continuous-time latent space end-to-end, retaining the flexibility of SI for arbitrary priors and simulation-free training while benefiting from the efficiency of a low-dimensional latent space.
  • Core Idea: [Deriving Interpolants from ELBO, not vice versa] Instead of defining an interpolant first and solving for the velocity field as in SI, the paper treats latent variables as continuous-time dynamic variables following an SDE. It formulates a continuous-time ELBO and uses a diffusion bridge (Doob h-transform) to construct a variational posterior that allows simulation-free sampling. This naturally derives the latent stochastic interpolant \(z_t\), resulting in a unified objective.

Method

Overall Architecture

LSI models generation as follows: prior \(z_0\sim p_0\) → latent SDE drift \(h_\theta\) evolves to \(z_1\) → decoder \(p_\theta(x_1|z_1)\) outputs the image. For training, sampling from the posterior \(p_\theta(z_t|x_1)\) is required. The authors construct a variational posterior using "encoder-provided \(z_1\) + a diffusion bridge connecting \(z_0\) and \(z_1\)." Under the assumption of a linear SDE, \(z_t\) can be sampled directly without simulation. The three components (E/D/L) are jointly optimized via a single ELBO.

flowchart LR
    X[Observation x1] -->|Encoder pθ z1 given x1| Z1[Latent z1, t=1]
    P0[Prior p0 samples z0, t=0] --> Bridge
    Z1 --> Bridge[Diffusion Bridge: Direct sampling of zt]
    Bridge --> ZT[Latent Interpolant zt]
    ZT -->|Learn drift hθ zt t| Drift[Latent SI Model L]
    Drift --> ELBO[Continuous-time ELBO Jointly Optimizes E/D/L]
    Z1 -.Reconstruction.-> Dec[Decoder pθ x1 given z1] --> ELBO

Key Designs

1. Continuous-time ELBO: Casting Latent Generation as KL Control of Path Measures
The foundation of the method is the evidence lower bound written for "continuous-time dynamic latent variable" models. Given a model path measure \(P_\theta\) (with drift \(h_\theta\)) and a variational posterior path measure \(Q\) (with drift \(h_\phi\) and shared diffusion \(\sigma\)), the ELBO is defined as: $\(\ln p_\theta(x_1)\ge \mathbb{E}_Q[\ln p_\theta(x_1|z_1)] - \mathrm{KL}(Q\|P_\theta)\)$ The KL term simplifies to a path integral \(\tfrac12\int_0^T\|u(z_t,t)\|^2dt\), where \(\sigma u = h_\phi - h_\theta\). This term penalizes the mismatch between the "variational dynamics" and the "model dynamics" as a differentiable objective, while the first term represents VAE-style reconstruction. This continuous-time form allows arbitrary priors, likelihood control, and simulation-free training to coexist.

2. Diffusion Bridge for Simulation-free Variational Posterior: Bypassing SDE Numerical Simulation
The challenge is that the ELBO requires sampling \(z_t\sim p_\theta(z_t|x_1)\). Using an arbitrary \(h_\phi\) would require numerical integration at every training step, which is prohibitively expensive. The authors instead explicitly construct the drift: the encoder provides \(z_1\sim p_\theta(z_1|x_1)\), and a diffusion bridge via Doob’s h-transform \(dz_t=[h_\phi+\sigma\sigma^\top\nabla_{z_t}\ln p(z_1|z_t)]dt+\sigma dw_t\) connects the prior \(p_0(z_0)\) with the aggregate posterior at the \(t=1\) endpoint. By assuming a linear SDE \(dz_t=h_t z_t dt+\sigma_t dw_t\), the transition density becomes Gaussian, yielding a closed-form \(\nabla_{z_t}\ln p(z_1|z_t)\). Consequently, the bridge conditional density \(p(z_t|z_1,z_0)\) is also Gaussian, allowing one-step direct sampling of \(z_t\), recovering the simulation-free efficiency of observation-space diffusion.

3. Latent Stochastic Interpolants: Reparameterizing \(z_t\) from a Gaussian Bridge
Using the Gaussian bridge described above, \(z_t\) is reparameterized as \(z_t=\eta_t\epsilon+\kappa_t z_1+\nu_t z_0,\ \epsilon\sim\mathcal{N}(0,I)\), where the coefficients satisfy the endpoint constraints \(\kappa_0=\nu_1=0,\ \kappa_1=\nu_0=1,\ \eta_0=\eta_1=0\). This is effectively the latent space version of SI interpolants. The authors choose \(\kappa_t,\nu_t\) first and then derive \(h_t,\sigma_t\). Setting \(\kappa_t=t,\nu_t=1-t\) results in constant diffusion \(\sigma_t=\sigma\), and the interpolant simplifies to \(z_t=\sigma\sqrt{t(1-t)}\,\epsilon+t z_1+(1-t)z_0\). If the prior is chosen as a standard Gaussian, it simplifies further. When the encoder/decoder are identity mappings, LSI reduces exactly to observation-space SI.

4. InterpFlow Parameterization to Stabilize Training Variance
Substituting \(u(z_t,t)\) back into the ELBO leads to a naive loss containing \(\sqrt{1-t}\) in the denominator, resulting in gradient variance explosion. The authors adopt the InterpFlow parameterization \(\tfrac{\beta_t}{2}\big\|-\sigma\sqrt{t}\,\epsilon+\sqrt{1-t}(z_1-z_0)+\sqrt{t}\,z_t-\hat h_\theta(z_t,t)\big\|^2\) and use variable substitution \(t(s)=1-(1-s)^c\) to make the time weight \(\beta_t=\beta/(1-t)\) a constant \(\beta\). The weight \(\beta\) acts similarly to \(\beta\) in a \(\beta\)-VAE: as \(\beta\to0\), the model approximates a fixed pre-trained autoencoder, while larger \(\beta\) values allow the encoder to adjust the representation for the generative objective. For sampling, the authors utilize the equivalent SDE family from Singh & Fischer (2024), enabling the adjustment of stochasticity via \(\gamma_t\) without retraining.

Key Experimental Results

Main Results

Class-conditional generation on ImageNet (FID @ 2000 epochs). Comparison between latent LSI and observation-space SI (Parameters in M / FLOPs per forward pass in G; E/D/L represent Encoder/Decoder/Latent model):

Resolution Latent FID Obs. FID Params Latent (E/D/L) Params Obs. FLOPs Latent (E/D/L) FLOPs Obs.
64×64 2.62 2.57 392 (5/5/382) 398 15/15/161 201
128×128 3.12 3.46 392 (5/5/382) 400 59/59/327 466
256×256 3.91 3.87 393 (5/5/383) 405 240/240/450 1288

LSI achieves FIDs comparable to observation-space SI across resolutions. The key advantage lies in sampling efficiency: because the encoder is not used during sampling and the decoder is run only once, whereas the latent model L runs at every step, the FLOP savings accumulate during multi-step sampling. At 128×128 with 100-step sampling, LSI saves 73.6% FLOPs; at 256×256, it saves 48.6%.

Ablation Study

Capacity Transfer (128×128): Moving \(k\) convolutional blocks from the latent model L to the encoder/decoder while keeping total parameters constant significantly reduces sampling FLOPs. A comparison between joint training (\(\beta>0\)) and independent training (\(\beta\to0\)) follows:

k FID (\(\beta>0\)) FID (\(\beta\to0\)) Params (E/D/L) FLOPs (E/D/L)
0 3.76 4.31 392 (5/5/382) 59/59/327
3 3.91 4.55 389 (9/8/372) 68/66/313
6 3.96 4.87 387 (13/12/362) 75/73/299
9 4.61 4.98 383 (16/16/351) 82/80/284

Joint training consistently performs better and shows slower FID degradation as capacity is shifted.

Key Findings

  • Joint Training Gains: FID improved from 4.53 (\(\beta\to0\)) to 3.75 (\(\beta=0.0001\)), roughly a 17% gain. Allowing the encoder to adapt latent representations for generative goals is effective.
  • Encoder Noise Scale \(c\) is Crucial: Deterministic encoders (\(c=0\)) performed worst. Fixing \(c\) outperformed learning it.
  • Unification: With identity mappings for the encoder/decoder, LSI reduces exactly to observation-space SI, validating it as a strict generalization.

Highlights & Insights

  • Perspective Shift: While SI "defines the interpolant and solves the velocity field," LSI "defines the ELBO and derives the interpolant via a diffusion bridge." This inversion is the key to bringing SI into an unobservable latent space.
  • Unified Single Objective: The encoder, decoder, and latent generative model are optimized jointly via a single continuous-time ELBO, replacing the multi-stage pipeline and aligning the latent representation with the generative process.
  • Efficiency through Structure: By concentrating computation in a lightweight latent model L and omitting the encoder during sampling, savings grow linearly with the number of sampling steps.
  • Interpretability of \(\beta\): \(\beta\) serves as a clear knob on a spectrum between using a pre-trained autoencoder and joint adaptation, providing a clear intuition for tuning.

Limitations & Future Work

  • Linear SDE Assumption: Simulation-free sampling depends on the assumption of a linear \(h_\phi\) and additive noise, which theoretically limits the expressivity of the variational posterior.
  • Single Observation Moment: While the ELBO supports multiple observations \(x_{t_i}\), the experiments only use \(t=1\), leaving the potential for time-series data unexplored.
  • Narrow Evaluation Scope: The experiments focus on ImageNet class-conditional generation without covering text-to-image, super-resolution, or video tasks.
  • Variance Sensitivity: The naive ELBO loss is high-variance and necessitates the InterpFlow parameterization and specific time-variable substitutions.
  • Stochastic Interpolants (Albergo et al., 2023): The direct predecessor; LSI generalizes it to the latent space.
  • Continuous-time ELBO / Latent SDE: Provides the lower bound formulation for dynamic latents.
  • Diffusion Bridges: The core tool for constructing simulation-free posterior sampling.
  • Latent Diffusion: Also performs latent generation but typically uses multi-stage training and simpler priors; LSI is end-to-end and supports arbitrary priors.
  • Insight: When the target distribution is unobservable or evolves during training, "deriving the forward process from a variational lower bound" is a more general path than "manually constructing a fixed interpolant."

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Successfully brings the SI framework into a jointly trainable latent space using a principled continuous-time ELBO approach.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Solid results on ImageNet with detailed ablations on capacity transfer and hyper-parameters, though limited to few tasks.
  • Writing Quality: ⭐⭐⭐⭐ — Rigorous derivations and clear motivations, though technically demanding.
  • Value: ⭐⭐⭐⭐ — Offers a principled foundation for "latent + arbitrary prior + end-to-end" generation with practical benefits in sampling efficiency.