Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction¶
Conference: CVPR 2026 arXiv: 2512.05597 Code: None Area: 3D Vision Keywords: 3D scene understanding, multi-token prediction, structured language model, inference acceleration, self-speculative decoding
TL;DR¶
This paper proposes Fast SceneScript, which introduces multi-token prediction (MTP) into structured language models for 3D scene understanding to accelerate inference. Combined with self-speculative decoding (SSD) and confidence-guided decoding (CGD) to filter unreliable tokens, as well as a parameter-efficient head-sharing mechanism, the method achieves 5.09× and 5.14× speedups on layout estimation and object detection respectively without accuracy loss.
Background & Motivation¶
- Background: Structured language model-based 3D perception methods such as SceneScript represent 3D scenes as token sequences (e.g.,
[make_wall, x1, y1, z1, x2, y2, z2, height, thickness]), enabling a single model architecture to handle multiple tasks including layout estimation, 3D object detection, and coarse-grained reconstruction. - Limitations of Prior Work:
- Autoregressive next-token prediction (NTP) is slow; inference latency grows with sequence length (e.g., 1176ms on Structured3D);
- Directly applying MTP reduces inference steps but severely degrades accuracy (F1-Score drops from 0.913 to 0.840 with 8 heads);
- MTP introduces \((n-1)\) additional token heads, substantially increasing parameter count (14M→23.67M).
- Key Challenge: The trade-off between MTP speedup and token prediction accuracy, compounded by the overhead of additional parameters.
- Goal: How to achieve multi-fold inference acceleration of structured language models while preserving accuracy with minimal parameter overhead?
- Key Insight: Structured language (vs. natural language) exhibits stronger determinism and weak inter-token coupling, making MTP more feasible; the key challenge is designing reliable token filtering strategies to reject inaccurate predictions.
- Core Idea: Predict multiple tokens via MTP, then filter unreliable tokens through SSD/CGD, retaining only the longest reliable prefix — a "predict aggressively, accept selectively" acceleration paradigm.
Method¶
Overall Architecture¶
Input 3D point clouds are encoded into features via a sparse 3D ResNet. A language decoder (Transformer with self/cross-attention) predicts \(n\) future tokens and \((n-1)\) confidence scores in a single step, conditioned on preceding tokens and 3D features. A shared Projection Block and Token Head process the hidden states for each token position. A token filtering stage rejects unreliable tokens, accepting only the longest reliable prefix.
Key Designs¶
-
Parameter-Efficient Multi-Token Prediction:
- Function: Predicts \(n\) tokens per inference step, reducing decoding steps from \(N\) to \(\lceil N/n \rceil\).
- Mechanism: Conventional MTP introduces independent token heads for each additional position, causing linear parameter growth. Fast SceneScript shares a single Token Head across all \(n\) heads. A lightweight Projection Block (2 FFN blocks, each with 2 linear layers + ReLU + LayerNorm) maps the language decoder hidden state \(f_{k+1}\) to \(n-1\) distinct hidden states \(f_{k+i}\). The Projection Block is shared across all heads, adding only ~7.5% parameters (vs. 69% for MTP-8).
- Design Motivation: Language model hidden states are context-dependent and reside in a shared semantic space; although hidden states at different positions differ, they can be decoded by the same head. A Transformer FFN-like structure suffices to generate discriminative features.
-
Self-Speculative Decoding (SSD):
- Function: Filters unreliable tokens via two-step verification to preserve accuracy.
- Mechanism: Step one uses \(n\) MTP heads to predict candidate tokens \(\{t_{k+1}, ..., t_{k+n}\}\); step two feeds these tokens as prefix input and uses the first (most reliable) head to re-predict \(\{\tilde{t}_{k+2}, ..., \tilde{t}_{k+n}\}\). Only the longest consistent prefix between the two steps is accepted. For numerical tokens, a distance threshold \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\) is used instead of strict equality to increase acceptance rate.
- Design Motivation: Small errors in numerical tokens (coordinates, heights) in structured language are acceptable; distance-based matching is more appropriate than the exact matching used in natural language settings.
-
Confidence-Guided Decoding (CGD):
- Function: Predicts tokens and confidence scores within the same inference step, enabling immediate filtering.
- Mechanism: Each additional head is paired with a Confidence Head that predicts the probability \(c_{k+i}\) that the token agrees with the first head's output. During training, a BCE loss supervises this: \(\hat{c}_{k+i}=1\) if \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\). At inference, tokens with \(c_{k+i} < \epsilon\) are marked as unreliable. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{MTP}} + \lambda_c \mathcal{L}_c\).
- Design Motivation: SSD requires an additional verification step that increases latency. CGD completes prediction and verification within the same step, offering a more elegant single-step solution at the cost of training an additional Confidence Head.
Loss & Training¶
- MTP loss: \(\mathcal{L}_{\text{MTP}} = -\sum_k \sum_i \lambda_h^{i-1} \log p(t_{k+i}|t_{\leq k})\), where \(\lambda_h\) is a decay factor assigning lower weight to more distant token predictions.
- Confidence loss: \(\mathcal{L}_c = -\sum_{i,k} \lambda_h^{i-1} (\hat{c}_{k+i} \log c_{k+i} + (1-\hat{c}_{k+i}) \log(1-c_{k+i}))\)
Key Experimental Results¶
Main Results (ASE Dataset, Layout Estimation)¶
| Method | n | Params | Latency | α (tokens/step) | F1-Score (test) |
|---|---|---|---|---|---|
| SceneScript | 1 | 14.00M | 382ms | 1 | 0.915 |
| SceneScript+MTP | 4 | 18.14M | 109ms | 4 | 0.889 |
| SceneScript+MTP | 8 | 23.67M | 62ms | 8 | 0.842 |
| SceneScript+MTP | 10 | 26.43M | 54ms | 10 | 0.814 |
| Fast SceneScript (SSD) | 8 | 15.05M | 81ms | 7.45 | 0.913 |
| Fast SceneScript (CGD) | 8 | 16.10M | 92ms | 6.30 | 0.913 |
| Fast SceneScript (SSD) | 10 | 15.05M | 75ms | 8.97 | 0.912 |
Cross-Dataset Comparison (Structured3D, Layout Estimation)¶
| Method | Latency | F1-Score |
|---|---|---|
| RoomFormer | 54ms | 0.702 |
| SceneScript | 1176ms | 0.774 |
| Fast SceneScript (SSD, n=8) | 230ms | 0.791 |
| Fast SceneScript (CGD, n=8) | 269ms | 0.795 |
Key Findings¶
- SSD accepts more tokens per step than CGD (7.45 vs. 6.30) with lower latency, but CGD requires no additional verification step.
- The parameter-efficient mechanism substantially reduces parameters: at n=8, from 23.67M to 15.05M (−36%), adding only 7.5% over the original SceneScript.
- At n=10, vanilla MTP accuracy degrades severely (F1 drops to 0.814), while Fast SceneScript maintains 0.912.
- The distance threshold \(\tau\) for numerical tokens significantly improves acceptance rate: introducing distance-based matching yields ~1 additional accepted token per step under SSD.
- Experiments on SceneCAD validate both layout estimation and object detection, achieving ~5× speedup with improved accuracy on both tasks.
Highlights & Insights¶
- Determinism of Structured Language as a Natural Advantage for MTP: The strong structural constraints of structured language — e.g., coordinates must follow
make_wall— make multi-token prediction far more feasible than in natural language. This insight generalizes to all structured output tasks (e.g., code generation, SQL query synthesis). - "Predict Aggressively, Accept Selectively" Paradigm: Rather than demanding accuracy at every token position, the method makes bold predictions and filters out unreliable ones. SSD and CGD offer complementary trade-offs: SSD achieves lower latency but requires an extra verification pass; CGD is more elegant but requires training a confidence head.
- Parameter Sharing Strategy: By exploiting the shared semantic nature of the language model's hidden space, a single lightweight Projection Block replaces \(n-1\) independent heads, reducing parameters by 43% without sacrificing accuracy.
Limitations & Future Work¶
- SSD requires an additional forward pass for verification, making actual speedup slightly below the theoretical upper bound.
- CGD's confidence threshold \(\epsilon\) requires manual tuning and may need to be adjusted across different datasets.
- Validation is currently limited to 3D scene understanding tasks; applicability to 2D perception (e.g., object detection) remains unexplored.
- Token filtering retains only the longest reliable prefix, meaning a single unreliable intermediate token causes all subsequent tokens to be discarded, which may be overly conservative.
Related Work & Insights¶
- vs. SceneScript: The direct predecessor; Fast SceneScript preserves its architecture and interface, accelerating only the inference stage.
- vs. Medusa / DeepSeek-V3: Natural language MTP methods. This work is the first to apply MTP to structured perception language models, identifying the need for distance-based rather than exact-match token acceptance.
- vs. RoomFormer: The conventional detection-based approach has lower latency (54ms vs. 230ms) but lower F1 (0.702 vs. 0.791) and lacks the flexibility of language model-based methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ First application of MTP to structured perception language models; CGD and parameter-sharing designs are original contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, two tasks, and detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Methodology is clearly presented with intuitive table design.
- Value: ⭐⭐⭐⭐ 5× inference speedup holds significant engineering value for real-time 3D perception systems.