Skip to content

The Universal Normal Embedding

Conference: CVPR 2026 arXiv: 2603.21786 Code: https://github.com/ (open-sourced, with the NoiseZoo dataset) Area: Diffusion Models / Representation Learning Keywords: Latent Space Gaussianity, Generative-Encoder Unification, DDIM Inversion, Linear Semantic Editing, Representation Geometry

TL;DR

This paper proposes the Universal Normal Embedding (UNE) hypothesis: the latent spaces of generative models (diffusion models) and visual encoders (CLIP, DINO) share an approximately Gaussian underlying geometric structure, and both can be viewed as noisy linear projections of this shared space. The hypothesis is validated through the NoiseZoo dataset and extensive experiments, and the paper demonstrates the feasibility of direct linear semantic editing in the DDIM inversion noise space.

Background & Motivation

Background: Generative models (VAE, GAN, diffusion models) and visual encoders (CLIP, DINO) have generally evolved along independent technical trajectories — the former optimizing image synthesis quality, the latter optimizing semantic representation capacity. However, prior work has identified two intriguing phenomena: (1) models within the same family can have their latent spaces "stitched" together via simple linear mappings; and (2) encoders across different architectures and modalities exhibit linearly alignable representations.

Limitations of Prior Work: Although theoretical frameworks such as the Platonic Representation Hypothesis predict that different models converge toward a shared latent description, they do not specify the geometric structure of that shared space. In practice, semantic editing in diffusion models relies on text prompts, architectural modifications, or additional fine-tuning, lacking an approach grounded directly in latent space geometry.

Key Challenge: Encoder latent spaces are known to be linearly separable in semantics (linear probes suffice for classification), but it remains unclear whether the noise space of generative models possesses an equivalent semantic structure. If both spaces truly originate from the same underlying space, linear semantic operations should be directly applicable in the noise space — yet this has not been systematically verified.

Goal: (1) Formalize the "shared Gaussian latent space" hypothesis and provide empirical support; (2) verify whether DDIM inversion noise from diffusion models encodes semantic information comparable to that of encoders; (3) demonstrate the feasibility of direct linear semantic editing in the noise space.

Key Insight: Generative models sample from Gaussian noise to produce images, while encoders map images to embeddings that are empirically approximately Gaussian — these two directions are in fact two "perspectives" on the same Gaussian latent space. The authors formalize this as the Induced Normal Embedding (INE): each model's latent space is a noisy linear projection of the ideal UNE.

Core Idea: Encoder embeddings and DDIM inversion noise are both linear projections of the same underlying Gaussian latent space; consequently, linear probes can discover semantic directions in the noise space and enable controllable editing directly.

Method

Overall Architecture

Rather than proposing a new network architecture, this paper introduces a theoretical hypothesis (UNE) and validates it through systematic experiments. The core pipeline is: (1) construct the NoiseZoo dataset — extract encoder embeddings and DDIM inversion noise from multiple models for CelebA images; (2) verify the Gaussianity of each model's latent space; (3) train linear probes to test semantic separability; (4) test alignment via cross-space linear mapping; (5) perform semantic editing along linear probe directions; (6) recover a multi-model shared subspace via GCCA.

Key Designs

  1. Formalization of the UNE Hypothesis:

    • Function: Provide a unified understanding of the relationship between encoder and generative model latent spaces.
    • Mechanism: Assumes the existence of an ideal multivariate standard normal latent space \(Z \sim \mathcal{N}(0, I)\) (UNE), where each model \(i\)'s latent space is a noisy linear projection \(\hat{Z}_i = C_i Z + \epsilon_i\) (termed Induced Normal Embedding, INE). When noise-free and \(C_i\) is invertible, semantics that are linearly separable in the UNE remain linearly separable in the INE. Semantics are consistently preserved across all models along directions shared by multiple INEs.
    • Design Motivation: Advances the intuition of "models converging to shared representations" toward a concrete geometric constraint — Gaussianity implies that semantic variation corresponds to linear directions, making linear probing and linear editing natural operations.
  2. NoiseZoo Dataset Construction:

    • Function: Provide paired latent representations of the same images across multiple models.
    • Mechanism: Uses approximately 19k images from the CelebA validation set, extracting embeddings (500–1k dimensions) from 5 encoders (CLIP ViT-B/16, CLIP ViT-L/14, OpenCLIP ViT-B/16, OpenCLIP ViT-L/14, DINOv3) and DDIM inversion noise (~16k dimensions) from 3 diffusion models (SD 1.5, SD 2.1, LCMv7) for each image. The train/test split is 15k/4k.
    • Design Motivation: Paired data is the foundation for verifying cross-model alignment. Prior work compared only within the same model family; NoiseZoo is the first to systematically place encoders and generative models within a unified framework.
  3. Linear Semantic Editing and Orthogonal Disentanglement:

    • Function: Perform controllable attribute editing along linear directions in the DDIM inversion noise space.
    • Mechanism: A linear classifier (logistic regression) is used to identify the direction \(w\) corresponding to an attribute in the noise space; the editing operation is \(\tilde{z} = z + \alpha w\), where \(\alpha\) controls editing strength. For attribute entanglement (e.g., modifying facial hair affecting face shape), the target direction is projected onto the null space of the interfering attribute via orthogonalization: \(\tilde{w}_1 = w_1 - \frac{w_2 w_2^\top}{w_2^\top w_2} w_1\), achieving disentangled editing.
    • Design Motivation: If the UNE hypothesis holds, linear directions correspond to semantic changes — no prompts, no fine-tuning, and no architectural modifications are needed; simple vector arithmetic suffices. Orthogonalization is a natural disentanglement tool in Gaussian spaces.

Loss & Training

This paper does not involve training new networks. Linear probes are trained using standard logistic regression (L2 regularization). Cross-space mappings use ridge regression. Shared subspace recovery uses the MAXVAR formulation of GCCA (Generalized CCA), which admits a closed-form solution.

Key Experimental Results

Main Results

Gaussianity test (1D random projection, Anderson-Darling pass rate):

Model AD Pass Rate ↑ Type
SD 1.5 96.00% Generative
SD 2.1 95.80% Generative
LCMv7 95.58% Generative
CLIP B16 89.50% Encoder
CLIP L14 91.90% Encoder
DINOv3 84.48% Encoder
Bimodal Gaussian (control) 15.88% Non-Gaussian

Accuracy drop after cross-space linear mapping:

Generative → Encoder Cosine Similarity Accuracy Drop
SD 1.5 → CLIP B16 0.80 0.20 pp
SD 2.1 → CLIP B16 0.80 0.14 pp
LCM → CLIP B16 0.81 0.00 pp

Ablation Study

Shared subspace classification (16-dim PCA vs. shared space):

Space 16-dim Classification Accuracy Notes
CLIP B16 (PCA-16d) ~79% Single model, low-dim
SD 1.5 (PCA-16d) ~77% Single model, low-dim
Shared Space X1 (16d) ~78% 4-model intersection
Shared Space X5 (16d) ~77% 6-model intersection

Key Findings

  • Diffusion model noise spaces exhibit extremely strong Gaussianity: SD 1.5 achieves an AD pass rate of 96%, approaching the theoretical 95% boundary. Encoders also score 84–92%, far above the non-Gaussian control.
  • Noise spaces encode richly linearly separable semantics: Across 40 CelebA attributes, the linear probe accuracy on DDIM inversion noise is highly correlated with that of CLIP, with near per-attribute parity.
  • Cross-space linear mapping error is minimal: After linear mapping from generative models to encoders, classification accuracy drops by less than 0.3 percentage points, confirming that the two types of spaces are linearly aligned.
  • Low-dimensional shared subspaces retain substantial attribute information: A shared subspace of only 16 dimensions achieves classification performance close to single-model PCA-16d.
  • Linear editing in the noise space produces naturally smooth results (smile, gender, age, etc.), and orthogonalization effectively eliminates attribute entanglement.

Highlights & Insights

  • "Generation and encoding are two sides of the same coin": This conceptual insight is remarkably elegant. Once the UNE hypothesis is accepted, many empirical observations about cross-model alignment acquire a unified explanatory framework. This perspective can inform the design of future foundation models that simultaneously possess understanding and generation capabilities.
  • Research value of the NoiseZoo dataset: The combination of paired encoder embeddings and diffusion noise constitutes a unique research resource that can catalyze a large body of subsequent work on latent space geometry analysis.
  • Semantic editing requiring no additional training: Controllable editing (modifying smile, age, gender, etc.) is achieved purely through vector addition in the noise space, with simple and effective orthogonal disentanglement. This is considerably more straightforward than existing prompt engineering or model fine-tuning approaches.
  • An "executable hypothesis" distinct from pure theory: UNE is not merely a high-level conjecture; it immediately yields testable predictions (Gaussianity, linear separability, cross-model alignment, low-dimensional shared subspaces), all of which are experimentally confirmed.

Limitations & Future Work

  • Experiments are validated only on CelebA face data and have not been extended to more diverse domains such as natural scenes (ImageNet) or medical images — the generality of UNE remains to be more broadly examined.
  • Only 3 models from the Stable Diffusion family are used; the hypothesis has not been verified for other generative architectures (e.g., DALL-E 3, Flux, Consistency Models).
  • The shared subspace is recovered via GCCA but is not compared against stronger nonlinear alignment methods (the authors intentionally restrict themselves to linear methods to validate the hypothesis, though nonlinear methods may perform better in practice).
  • DDIM inversion noise is extremely high-dimensional (~16k), posing significant storage and computational costs in practical applications.
  • Gaussianity is slightly lower for encoders (especially DINOv3 at only 84%); whether systematic deviations exist warrants deeper analysis.
  • Quantitative evaluation of semantic editing (FID, LPIPS, attribute accuracy, etc.) is insufficiently comprehensive.
  • vs. Platonic Representation Hypothesis: PRH proposes the broad conjecture that models converge to a shared representation, but does not specify a geometric structure. UNE explicitly identifies Gaussianity as the key geometric constraint and unifies both encoder and generative model families.
  • vs. latent space linear stitching works (LIT, Model Stitching): These works demonstrate linear alignment within the same model family; UNE's contribution is to extend alignment across families (encoders ↔ generators).
  • vs. StyleGAN latent space editing: Although StyleGAN's W/W+ space also supports linear editing, diffusion models lack a persistent latent code. UNE demonstrates that DDIM inversion noise naturally possesses analogous linear semantic structure.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Unifying encoders and generative models under a shared Gaussian latent space hypothesis represents a highly original conceptual contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of multi-model Gaussianity tests, cross-space mapping, linear editing, and shared subspace experiments, though the dataset is limited to CelebA.
  • Writing Quality: ⭐⭐⭐⭐⭐ The narrative from hypothesis to theory to experiment is fluent and logically coherent, with well-crafted figures and tables.
  • Value: ⭐⭐⭐⭐⭐ Offers a unified perspective with potentially far-reaching implications for representation learning and generative modeling.