Multi-Scale Local Speculative Decoding for Image Generation¶
Conference: CVPR2026
arXiv: 2601.05149
Code: https://qualcomm-ai-research.github.io/mulo-sd-webpage/ (Project Page)
Area: Image Generation
Keywords: Autoregressive image generation, speculative decoding, multi-scale, spatial locality, parallel decoding
TL;DR¶
MuLo-SD introduces a multi-scale approach—"low-resolution draft + upsampling + high-resolution parallel verification"—into speculative decoding. By replacing traditional raster-scan full-sequence backtracking with "local neighborhood resampling of rejected tokens" combined with parallel decoding, it achieves up to \(5.33\times\) end-to-end acceleration for autoregressive image generation on Tar-1.5B/7B while maintaining semantic alignment and perceptual quality.
Background & Motivation¶
Background: Unified multi-modal large models (MLLMs) integrate language and vision understanding/generation into an autoregressive (AR, next-token prediction) framework, demonstrating stronger text-to-image alignment than diffusion models. However, AR decoding is inherently serial; at high resolutions, the token count grows quadratically (often exceeding thousands for 1024p), making latency a critical bottleneck.
Limitations of Prior Work: Existing acceleration methods for AR image generation have drawbacks. ① Speculative Decoding (SD), successful in text (2–3× speedup), largely fails for images due to "flat" next-token distributions and low acceptance rates; EAGLE-2 even yields negative speedup (<1×) on images. LANTERN relaxes acceptance criteria using "codebook neighbor probability pooling + TVD constraints" (approx. 1.6–1.8× speedup), but it still operates strictly at the token level in raster-scan order: once a token is rejected, the entire subsequent sequence is discarded, failing to exploit 2D spatial structures or multi-scale priors. ② Next-scale prediction (VAR family) is fast but uses custom sampling schedules incompatible with the next-token framework and unified MLLMs (cannot utilize KV-cache).
Key Challenge: Images are inherently 2D, multi-scale, and spatially local, yet current image speculative decoding treats them as 1D token streams. This ignores natural "coarse-to-fine" hierarchies and spatial correlations where nearby tokens are strongly related. A single rejection currently invalidates the entire suffix.
Goal: Inject multi-scale structure and spatial locality into speculative decoding to improve acceptance rates and reduce serial steps without abandoning the next-token prediction objective (preserving compatibility with unified MLLMs).
Key Insight: AR image models often release checkpoints for multiple resolutions (256p/512p/1024p). A low-resolution model can serve as a drafter (cheap, short sequences), while the high-resolution model acts as the verifier. Building on the fact that visual AR attention is local (as shown in ZipAR/LPD), resampling is performed only within a small neighborhood of the rejected token.
Core Idea: Replace "same-resolution token-by-token drafting + raster-scan backtracking" with "low-resolution drafting → upsampling → high-resolution parallel verification → local neighborhood resampling" to fully exploit multi-scale and spatial locality.
Method¶
Overall Architecture¶
MuLo-SD takes a text prompt and produces a target resolution \(s_p\) (512p or 1024p) image sequence of length \(N\). It works in a loop using a "draft–upsample–verify–local backtrack" cycle until \(|x|=N\).
Let the target model \(M_p\) operate at high resolution \(s_p\) and the drafter \(M_q\) at low resolution \(s_q\), with ratio \(r=s_p/s_q\) (e.g., \(r=2\) for 512p, \(r=4\) for 1024p). In one cycle: ① \(M_q\) serially samples a batch of low-res draft tokens \(\tilde{y}\) (by full rows for consistency); ② An upsampler \(U_r\) magnifies \(\tilde{y}\) into high-res draft \(\tilde{x}=U_r(\tilde{y})\), expanding sequence length by \(r^2\); ③ \(M_p\) parallely verifies \(\tilde{x}\) using neighborhood pooled thresholding; ④ Rejected tokens undergo local expansion and are parallely resampled by \(M_p\) in "rejection island" groups; ⑤ Accepted tokens are concatenated to the prefix \(x\), then downsampled \(y=D_r(x)\) to serve as the next low-res prefix.
The pipeline contains two serial bottlenecks—low-res drafting (Step ①) and high-res resampling (Step ④)—both squashed by parallel decoding (ZipAR-style). Notably, drafter and verifier have equal capacity; speedup comes from NFE reduction and the quadratic gap in sequence length between resolutions.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Text Prompt"] --> B["Multi-Scale Draft<br/>Low-res Mq Sampling<br/>+ Upsampling Ur (r² expansion)"]
B --> C["High-res Mp Parallel Verification<br/>Neighborhood pooling threshold τ"]
C -->|"Rejected tokens RT"| D["Local Verification & Expansion<br/>Radius l neighborhood resampling"]
C -->|"Accepted tokens"| F["Prefix Concatenation + Downsampling Dr"]
D --> E["Parallel Backtracking<br/>Concurrent resampling via 8-connected islands"]
E --> F
F -->|"|x|<N (Return to draft)"| B
F -->|"|x|=N"| G["Output image token sequence"]
Key Designs¶
1. Multi-Scale Drafting: Coarse-to-fine priors via low-res drafter + upsampler Unlike traditional SD where the drafter is just a smaller model at the same resolution, \(M_q\) runs at a significantly lower resolution \(s_q\). Low-res sequences are shorter and cheaper to sample. The upsampler \(U_r\) maps \(\tilde{y}\) to \(\tilde{x}=U_r(\tilde{y})\), expanding length by \(r^2\). Drafting is performed by full rows to ensure spatial coherence for the upsampler. Two upsampler versions are provided: latent-space (masked row-causal conv + pixel-shuffle, requires training) and pixel-space (decode to pixels → off-the-shelf SR → re-encode, training-free, default).
2. Local Verification and Neighborhood Expansion: Confining rejection impact This is the core innovation. Raster-scan rules discard everything after the first rejection, yielding near-zero speedup under this scheme. MuLo-SD uses a relaxed threshold: a draft token is accepted if its pooled probability over a codebook neighborhood \(B_k(\tilde{x}_i)\) exceeds \(\tau\): $\(\text{Accept if } \sum_{x\in B_k(\tilde{x}_i)} p_i(x) \ge \tau\)$ For the set of rejected tokens \(R_T\), resampling only the token itself is insufficient as context is affected. Local expansion is introduced: for each \(t\in R_T\), a 2D neighborhood of radius \(l\) is defined as \(N(t,l)=\{u \mid |i_u-i_t|\le l,\ |j_u-j_t|\le l,\ u\ge t_0\}\). The union \(R_X=\bigcup_{t\in R_T}N(t,l)\) is resampled. This exploits the spatial locality of vision AR attention, maintaining high quality without a full suffix flush.
3. Parallel Decoding Integration: Flattening bottlenecks with Rejection Islands Remaining serial steps are addressed. For low-res drafting, ZipAR is used to parallelize the process. For high-res resampling, the set \(R_X\) is partitioned into 8-connected components ("rejection islands") \(\{\mathcal{I}_m\}_{m=1}^{M}=\mathrm{CC}_8(R_X)\). Since accepted tokens provide sufficient context, disjoint islands are resampled concurrently.
Loss & Training¶
Only the latent-space upsampler/downsampler requires training (on LAION-COCO-Aesthetic). Perceptual quality improves significantly when shifting from token classification loss to pixel-space reconstruction loss (MSE + LPIPS). High-frequency details are further enhanced using a lightweight PatchGAN discriminator. The default pixel-space upsampler remains training-free and model-agnostic.
Key Experimental Results¶
Tests were conducted on Tar-1.5B/7B (unified MLLM) and LlamaGen-XL. Efficiency is measured by speedup (serial latency / MuLo-SD latency) on A100 (batch=1). Metrics include GenEval/DPG-Bench (alignment) and FID/HPSv2 (quality).
Main Results¶
| Base (Res) | Method | Speedup↑ | GenEval↑ | DPG-Bench↑ | FID↓ | HPSv2↑ |
|---|---|---|---|---|---|---|
| LlamaGen-XL (512p) | EAGLE-2 | 0.96× | 37.1 | 65.1 | 56.2 | 23.1 |
| LANTERN | 1.59× | 36.3 | 64.5 | 55.4 | 22.2 | |
| MuLo-SD (2×) | 1.40× | 36.1 | 64.0 | 54.3 | 23.1 | |
| Tar-7B (512p) | LANTERN | 1.20× | 84.9 | 80.5 | 36.9 | 28.7 |
| MuLo-SD (2×) | 2.03× | 85.1 | 80.8 | 38.2 | 29.5 | |
| Tar-7B (1024p) | LANTERN | 1.45× | 82.9 | 80.5 | 34.6 | 29.4 |
| MuLo-SD (4×) | 5.33× | 85.4 | 80.8 | 34.8 | 29.5 |
Key Finding: EAGLE-2 (exact SD) consistently produces negative speedup (<1×), confirming standard SD's mismatch for visual tokens. MuLo-SD achieves the highest speedup across settings, particularly at higher resolutions—Tar-7B 1024p reaches \(5.33\times\) with improved GenEval scores.
Ablation Study¶
| Dimension | Comparison | Conclusion |
|---|---|---|
| Upsampler Loss | Token loss → Pixel MSE+LPIPS → +PatchGAN → SR | Pixel-space loss is vital for quality; GAN loss adds detail; SR is competitive and training-free. |
| Pooling | Draft token only vs. codebook k-nearest pooling | Pooling improves acceptance stability but gains are capped due to overlap with \(\tau\). |
| Local Backtrack | Raster-scan vs. naive local vs. local+expansion | Raster-scan requires low \(\tau\) (poor quality); naive local is worse; local+expansion preserves quality and speed. |
| Parallel Decoding | Serial vs. Parallel resampling | Parallel decoding reduces end-to-end latency without affecting quality metrics. |
Highlights & Insights¶
- Merging split paths: MuLo-SD reconciles the "multi-scale" (fast but incompatible) and "speculative decoding" (compatible but slow) trajectories. It treats low-res drafts as "coarse" and high-res verification as "fine."
- Rejection Islands: Utilizing 8-connectivity to enable concurrent resampling of disjoint regions is a clever way to exploit spatial locality beyond simple ZipAR.
- Resolution as the speedup driver: In an unconventional move, the drafter and verifier use the same capacity. Speedup is derived from resolution-induced sequence length reduction rather than model size disparity.
- Training-free path: The pixel-space SR path offers a model-agnostic, zero-cost engineering solution.
Limitations & Future Work¶
- Memory Overhead: Requires loading both low-res and high-res checkpoints plus dual KV-caches, which is demanding for memory-constrained devices.
- Distribution Alignment: Effectiveness drops if drafter/verifier are not well-aligned (e.g., LlamaGen's 1.40x). Self-speculative decoding (drafting from internal layers) is a proposed future solution.
- Metrics Caveat: FID comparisons across different \(\tau\) thresholds should be interpreted alongside speedup gains.
Related Work & Insights¶
- vs LANTERN: Both relax acceptance criteria, but MuLo-SD adds multi-scale drafting and local neighborhood expansion, yielding significantly higher speedups at equal quality.
- vs VAR/M-VAR: Shares "coarse-to-fine" philosophy but maintains next-token compatibility for unified MLLMs.
- vs ZipAR/LPD: MuLo-SD uses ZigAR/LPD as building blocks for parallelization during drafting and resampling phases.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to combine multi-scale priors with local SD; rejection island mechanism is highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐ Solid across multiple bases and resolutions; however, limited to images (missing video).
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, well-illustrated, and mathematically sound.
- Value: ⭐⭐⭐⭐⭐ High practical value for unified MLLM deployment with up to 5.3x gains.