ReFrame: Layer Caching for Accelerated Inference in Real-Time Rendering¶
Conference: ICML 2025
arXiv: 2506.13814
Code: https://ubc-aamodt-group.github.io/reframe-layer-caching/
Area: Image Generation
Keywords: Layer Caching, Real-Time Rendering, U-Net, Temporal Consistency, Inference Acceleration
TL;DR¶
Extends the intermediate layer caching technique (DeepCache) from diffusion models to U-Net/U-Net++ networks in real-time rendering pipelines, achieving an average of 1.4× inference speedup with negligible image quality degradation through a frame-difference adaptive caching strategy.
Background & Motivation¶
Background: Real-time rendering (e.g., DLSS 4.0) heavily relies on U-Net-style neural networks for tasks such as denoising, super-resolution, and frame extrapolation, with network inference accounting for a significant portion of the rendering pipeline.
Limitations of Prior Work: (a) High temporal correlation exists between rendered frames, yet full inference is performed for every frame; (b) methods utilizing inter-frame differences such as DeltaCNN are difficult to accelerate on current GPUs due to sparse computation; (c) DeepCache is designed solely for multi-step inference in diffusion models.
Key Challenge: In rendering, every frame must individually yield high-quality outputs (unlike diffusion models, which can tolerate approximations in intermediate steps), even though inter-frame features change slowly.
Goal: How to leverage temporal redundancy under the strict quality constraints of real-time rendering?
Key Insight: Cache deep intermediate features of U-Net to skip most of the encoder-decoder computation, recomputing only the shallow layers (which are sensitive to changes in new inputs).
Core Idea: Slow temporal variations in features during real-time rendering \(\rightarrow\) cache deep features \(\rightarrow\) adaptive refresh strategy \(\rightarrow\) training-free acceleration.
Method¶
Overall Architecture¶
In U-Net, the deep feature \(C_t\) is cached during full inference; subsequent frames only compute the first layer \(X^0\) and the final layer \(X^n\), replacing intermediate layers with \(C_t\). For U-Net++, all skip connection branches except the first layer are cached.
Key Designs¶
-
Layer Caching Mechanism:
- Function: Skip deep computation of the encoder-decoder
- Mechanism: Cache \(C_t = X^{n-1}(\text{concat}(\ldots))\), and for subsequent frames, compute \(O = X^n(\text{concat}(C_t, X^0(I)))\)
- Design Motivation: Deep features capture high-level semantics and vary the slowest across frames, while shallow features capture low-level details and are most sensitive to new inputs
-
Frame-Difference Adaptive Strategy (Frame Deltas):
- Function: Determine whether to refresh the cache based on the degree of change in the input
- Mechanism: Calculate the SMAPE between the current input and the cached frame, refreshing the cache if it exceeds a threshold \(\tau\). Two tiers are provided: high sensitivity (Delta_H) and low sensitivity (Delta_L)
- Design Motivation: A fixed Every-N strategy cannot adapt to unpredictable scene changes in rendering (e.g., rapid camera movement vs. static scenes)
-
Motion Vector Threshold Strategy:
- Function: Utilize existing motion vectors from the rendering pipeline to determine whether to refresh
- Mechanism: Refresh the cache when the average motion exceeds a threshold \(\tau\)
- Design Motivation: No additional storage overhead (as motion vectors are already pre-existing bi-products of the rendering pipeline)
Loss & Training¶
- Fully training-free: Network weights are not modified; caching logic is added solely during inference.
- Can be combined with orthogonal techniques such as quantization and pruning.
Key Experimental Results¶
Main Results¶
| Task | Scene | Strategy | Speedup ↑ | FLIP ↓ | SSIM ↑ |
|---|---|---|---|---|---|
| Frame Extrapolation | Sun Temple | Delta_H | 1.42× | 0.017 | 0.994 |
| Frame Extrapolation | Sun Temple | Delta_L | 1.72× | 0.033 | 0.984 |
| Super-Resolution | Sun Temple | Delta_H | 1.30× | 0.049 | 0.970 |
| Super-Resolution | Sun Temple | Delta_L | 1.85× | 0.118 | 0.930 |
| Image Synthesis | Garden Chair | Delta_H | 1.05× | 0.001 | 1.000 |
Ablation Study¶
| Strategy | Avg. Frame Skip Rate | Avg. Speedup | FLIP | Description |
|---|---|---|---|---|
| Every-2 | 50% | ~1.4× | Medium | Fixed interval |
| Every-4 | 75% | ~1.7× | High | Quality drops drastically during fast motion |
| Delta_H | 30-50% | 1.1-1.4× | Lowest | Adaptive, preserving quality |
| Delta_L | 60-80% | 1.5-1.9× | Low | Adaptive balance |
Key Findings¶
- FLIP < 0.05 is considered an acceptable quality loss in the rendering domain (reference values range from 0.05 to 0.28).
- Frame extrapolation tasks benefit the most (as continuous frame transitions are smoothest), followed by super-resolution.
- Adaptive strategies prevent the drastic image quality drop that fixed strategies experience during fast camera movements.
Highlights & Insights¶
- Training-free + Universality: No network retraining is required; it can be applied to any encoder-decoder network possessing skip connections.
- Extension to U-Net++: For the first time, the caching technology is expanded from U-Net to U-Net++.
- Reinvesting Saved Computation into Rendering: Saved inference time can be utilized to increase the ray-tracing sample rate, which may overall improve rendering quality.
Related Work & Insights¶
- vs DeepCache: DeepCache targets the fixed-step inference of diffusion models, while ReFrame addresses the single-frame output requirement of rendering; adaptive refreshing is the key difference.
- vs DeltaCNN: DeltaCNN utilizes pixel-level differences to sparsify computations, but existing GPU hardware struggles to accelerate sparse operations; the layer caching in ReFrame is fully compatible with existing hardware.
- vs DLSS: DLSS natively utilizes U-Net, and ReFrame can serve as a component to further accelerate it.
- The caching strategy and DeltaCNN can theoretically be combined: cached frames only compute shallow layers, which are then sparsified internally using deltas.
Limitations & Future Work¶
- Evaluated on an RTX 2080 Ti; verification of benefits on the latest GPUs (e.g., RTX 4090/5090) is lacking.
- The thresholds for the adaptive strategy require task-specific tuning, lacking an automated determination method.
- Only three rendering tasks and five scenes were evaluated, which limits the scope.
- Handling of rapid scene cuts (e.g., teleportation in games) is not discussed.
- Challenges regarding cache consistency in multiplayer synchronized rendering are not considered.
Rating¶
- Novelty: ⭐⭐⭐ The transfer of DeepCache to rendering is relatively direct; the adaptive strategy is an incremental contribution.
- Experimental Thoroughness: ⭐⭐⭐ The number of tasks and scenes is relatively small.
- Writing Quality: ⭐⭐⭐⭐ Clear and systematic, with rich illustrations.
- Value: ⭐⭐⭐⭐ Holds practical significance for optimizing rendering pipelines.