ObCLIP: Oblivious Cloud-Device Hybrid Image Generation with Privacy Preservation¶
Conference: NeurIPS 2025
arXiv: 2510.04153
Code: Unavailable
Area: Diffusion Models / Privacy-Preserving Image Generation
Keywords: Privacy preservation, hybrid inference, oblivious generation, attention caching, diffusion models
TL;DR¶
ObCLIP is proposed as an oblivious cloud-device hybrid image generation scheme. It expands a user prompt into a set of candidate prompts that differ only in sensitive attributes (e.g., gender, race), performs early denoising steps on all candidates in the cloud without revealing the true prompt, and allows the client to select the correct intermediate latent and complete the remaining denoising locally. Temporal and batch redundancy acceleration techniques reduce the additional overhead to below 4.4–7.6×.
Background & Motivation¶
Text-to-image generation services (e.g., Midjourney, DALL·E) face two core challenges:
Prompt privacy leakage: User-submitted prompts may contain sensitive attributes (gender, age, race) that the server can directly access. Even without exposing the prompt directly, the server can perform full image generation from received text embeddings, revealing sensitive visual features.
High server-side cost: As model scale grows (Scaling Law), computational costs increase dramatically.
Existing approaches each suffer from serious limitations: - Cryptographic methods (MPC, homomorphic encryption): Provide strong security guarantees but incur enormous computational overhead (HE-Diffusion exceeds \(10^6\times\) overhead), making them impractical. - Differential privacy perturbation (SANTEXT, etc.): Adds noise to prompts, inevitably causing semantic loss and degraded generation quality. - On-device models (SnapFusion, MobileDiffusion): Avoid data transmission but yield significantly lower image quality. - Hybrid generation (Hybrid SD): Reduces server overhead but does not protect prompt privacy—text embeddings are sent directly to the server and can be recovered via embedding inversion attacks.
A key empirical finding motivates the design: the initial denoising steps constitute a semantic planning phase critical for global semantic information. If the initial steps use candidate prompts, more than 80% of the subsequent steps with the true prompt are needed to correct the semantic drift. Thus, naive prompt substitution is insufficient.
Method¶
Overall Architecture¶
ObCLIP comprises two core components forming a complete pipeline: 1. Oblivious transformation: The true prompt \(p^*\) is expanded into a set of \(N\) candidate prompts \(\mathcal{P}\), differing only in sensitive attribute values. 2. Cloud-side partial denoising: The server runs a large model to perform the first \(k\) denoising steps (where \(k\) is a hyperparameter) on all \(N\) candidates. 3. Client-side extraction: The client selects the intermediate latent corresponding to the true prompt and completes the remaining denoising using a small local model.
Key Designs¶
-
Oblivious Generation: The security guarantee is grounded in the indistinguishability of candidate prompts. Theorem 1 proves that any probabilistic polynomial-time (PPT) adversary, given only \(\mathcal{P}\), can identify the true prompt \(p^*\) with probability no greater than \(1/N + \lambda\) (where \(\lambda\) is negligible). Candidate prompts are constructed by identifying sensitive attributes and enumerating their value spaces—e.g., expanding "portrait of young African woman" into all combinations of age × race × gender.
-
Batch Redundancy Acceleration: Since candidate prompts differ only in sensitive attributes, they share global semantics. Cross-attention and self-attention maps are visualized to confirm that global features such as background and gestures are highly similar across candidates. Therefore, attention maps are computed for a single pivot prompt and broadcast to all candidates: \(m^* = \text{get\_attention\_map}(q^*, k^*), \quad O = M \cdot V \{M \leftarrow \text{broadcast}(m^*)\}\) This substantially reduces computation in
to_q,to_k, and Softmax. -
Temporal Redundancy Acceleration: Two strategies are employed:
- Attention caching: Inspired by T-Gate, self-attention contributions become negligible after the first \(r\) steps and can be skipped. Cross-attention maps stabilize after steps 2–3, and are cached with refresh every 5 steps.
- Block skipping: Intermediate block outputs change minimally after steps 2–3. After skip point \(s\), only the UpBlock is computed: \(z_t = \begin{cases} (\text{DownBlock} \circ \text{MidBlock} \circ \text{UpBlock})(z_{t-1}, \mathcal{P}, t) & t < s \\ \text{UpBlock}(z_{t-1}, f_{mid}, \mathcal{P}, t) & t \geq s \end{cases}\)
Hyperparameter Control¶
Three key hyperparameters govern the efficiency–quality trade-off: - Switch point \(k\): Number of denoising steps executed in the cloud; larger \(k\) yields higher quality but higher cost. - Cache point \(r\): Step at which attention map caching begins. - Skip point \(s\): Step at which DownBlock+MidBlock skipping begins.
Key Experimental Results¶
Main Results: Candidate Prompt Dataset (Realistic Vision v4.0 + small-sd)¶
| Method | FID ↓ | IS ↑ | CLIP ↑ | Latency (s) | Notes |
|---|---|---|---|---|---|
| Realistic Vision (no privacy) | 113.45 | 4.69 | 0.3322 | 1.12 | Baseline |
| small-sd (on-device only) | 128.87 | 5.04 | 0.3051 | 0.78 | Low quality |
| Vanilla OG (oblivious+full cloud, N=2) | 113.45 | 4.69 | 0.3322 | 2.51 | 2× latency |
| HE-Diffusion | - | - | - | >\(10^6\) | Infeasible |
| Hybrid SD (k=10) | 117.18 | 4.96 | 0.3215 | 0.55 | No privacy |
| ObCLIP (k=10, +cache+reuse) | 114.26 | 4.82 | 0.3167 | 0.57 | Near Hybrid SD |
MS-COCO 30K Dataset (SD-v1.4 + BK-SDM-small)¶
| Method | FID ↓ | IS ↑ | CLIP ↑ | FLOPs (T) |
|---|---|---|---|---|
| SD-v1.4 (full model) | 13.86 | 37.75 | 0.3015 | 18.53 |
| BK-SDM-small | 18.30 | 31.73 | 0.2710 | 10.90 |
| ObCLIP (k=10, +cache) | 15.73 | 33.62 | 0.2865 | 5.84* |
| ObCLIP (k=5, +cache) | 16.45 | 33.36 | 0.2833 | 3.06* |
Ablation Study¶
| Configuration | FID (N=6) | Latency (s) | Notes |
|---|---|---|---|
| ObCLIP (k=10, no acceleration) | 114.05 | 2.90 | Base oblivious+hybrid |
| + Temporal caching | 115.65 | 1.85 | 36% latency reduction |
| + Batch reuse | 109.76 | 1.55 | 47% latency reduction, FID improves |
Key Findings¶
- Privacy protection is nearly free: At N=2, ObCLIP latency (0.57s) is almost identical to Hybrid SD without privacy protection (0.55s).
- Batch reuse improves quality: Reusing attention maps across candidate prompts not only reduces computation but also lowers FID from 114.05 to 109.76, possibly because shared global semantics reduces noise associated with sensitive attributes.
- Orders of magnitude faster than cryptographic methods: More than \(10^6\times\) faster than HE-Diffusion and 4.4–7.6× faster than vanilla oblivious generation.
- Effective on SDXL: On the SDXL+Koala-700m combination, ObCLIP (k=10) achieves FID=30.79, close to SDXL's 30.67, while reducing FLOPs to 45.11T (vs. 159.35T for SDXL).
Highlights & Insights¶
- Oblivious security paradigm: Unlike encryption or perturbation, this approach achieves information-theoretic security by having the server process both real and dummy prompts simultaneously—conceptually clean and principled.
- Empirical finding that "initial steps determine semantics": Only 20% of server-side steps are needed to capture the semantic planning capacity of large models, providing a theoretical basis for hybrid inference.
- Batch redundancy as a unique acceleration dimension: This acceleration mode is intrinsic to oblivious generation—the existence of multiple semantically similar candidate prompts is precisely what enables attention map reuse across them.
- Parameter \(k\) enables flexible quality–cost trade-offs: Users can adjust \(k\) to control generation quality under privacy protection based on their requirements.
Limitations & Future Work¶
- Identification of sensitive attributes and construction of candidate sets rely on rule-based methods and pretrained classifiers, which may miss certain privacy-sensitive information.
- The number of candidates \(N\) grows exponentially with the number of sensitive attributes (reaching 50+ for 3 attributes), causing rapid cost escalation.
- The semi-honest threat model is relatively weak—security is not guaranteed if the adversary deviates from the protocol.
- Privacy of image outputs is not considered (generated images may leak sensitive information).
- Batch reuse relies on the assumption that attention maps are similar across candidate prompts, which may fail when sensitive attributes substantially alter global semantics.
Related Work & Insights¶
- Hybrid SD first proposed cloud-device hybrid diffusion generation but did not address privacy—this work adds an oblivious layer on top of that framework.
- T-Gate's insight that initial steps serve as semantic planning directly motivates ObCLIP's allocation strategy and caching design.
- DeepCache's finding of temporal redundancy in U-Net intermediate blocks informs the block-skipping optimization.
- SANTEXT/CAPE DP perturbation methods perform reasonably in NLP but produce unacceptable semantic loss in text-to-image tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ The oblivious generation paradigm is distinctive, though the core acceleration techniques are largely combinations of existing methods.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple models and datasets with detailed comparisons of both latency and FLOPs.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated; the two research questions are addressed in a well-structured empirical manner.
- Value: ⭐⭐⭐⭐ Addresses a practical privacy–efficiency–quality trilemma with real-world significance for image generation services.