Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction¶

Conference: CVPR 2026 arXiv: 2512.05597 Code: None Area: 3D Vision Keywords: 3D scene understanding, multi-token prediction, structured language model, inference acceleration, self-speculative decoding

TL;DR¶

This paper proposes Fast SceneScript, which introduces multi-token prediction (MTP) into structured language models for 3D scene understanding to accelerate inference. Combined with self-speculative decoding (SSD) and confidence-guided decoding (CGD) to filter unreliable tokens, as well as a parameter-efficient head-sharing mechanism, the method achieves 5.09× and 5.14× speedups on layout estimation and object detection respectively without accuracy loss.

Background & Motivation¶

Background: Structured language model-based 3D perception methods such as SceneScript represent 3D scenes as token sequences (e.g., [make_wall, x1, y1, z1, x2, y2, z2, height, thickness]), enabling a single model architecture to handle multiple tasks including layout estimation, 3D object detection, and coarse-grained reconstruction.
Limitations of Prior Work:
- Autoregressive next-token prediction (NTP) is slow; inference latency grows with sequence length (e.g., 1176ms on Structured3D);
- Directly applying MTP reduces inference steps but severely degrades accuracy (F1-Score drops from 0.913 to 0.840 with 8 heads);
- MTP introduces \((n-1)\) additional token heads, substantially increasing parameter count (14M→23.67M).
Key Challenge: The trade-off between MTP speedup and token prediction accuracy, compounded by the overhead of additional parameters.
Goal: How to achieve multi-fold inference acceleration of structured language models while preserving accuracy with minimal parameter overhead?
Key Insight: Structured language (vs. natural language) exhibits stronger determinism and weak inter-token coupling, making MTP more feasible; the key challenge is designing reliable token filtering strategies to reject inaccurate predictions.
Core Idea: Predict multiple tokens via MTP, then filter unreliable tokens through SSD/CGD, retaining only the longest reliable prefix — a "predict aggressively, accept selectively" acceleration paradigm.

Method¶

Overall Architecture¶

Input 3D point clouds are encoded into features via a sparse 3D ResNet. A language decoder (Transformer with self/cross-attention) predicts \(n\) future tokens and \((n-1)\) confidence scores in a single step, conditioned on preceding tokens and 3D features. A shared Projection Block and Token Head process the hidden states for each token position. A token filtering stage rejects unreliable tokens, accepting only the longest reliable prefix.

Key Designs¶

Parameter-Efficient Multi-Token Prediction:
- Function: Predicts \(n\) tokens per inference step, reducing decoding steps from \(N\) to \(\lceil N/n \rceil\).
- Mechanism: Conventional MTP introduces independent token heads for each additional position, causing linear parameter growth. Fast SceneScript shares a single Token Head across all \(n\) heads. A lightweight Projection Block (2 FFN blocks, each with 2 linear layers + ReLU + LayerNorm) maps the language decoder hidden state \(f_{k+1}\) to \(n-1\) distinct hidden states \(f_{k+i}\). The Projection Block is shared across all heads, adding only ~7.5% parameters (vs. 69% for MTP-8).
- Design Motivation: Language model hidden states are context-dependent and reside in a shared semantic space; although hidden states at different positions differ, they can be decoded by the same head. A Transformer FFN-like structure suffices to generate discriminative features.
Self-Speculative Decoding (SSD):
- Function: Filters unreliable tokens via two-step verification to preserve accuracy.
- Mechanism: Step one uses \(n\) MTP heads to predict candidate tokens \(\{t_{k+1}, ..., t_{k+n}\}\); step two feeds these tokens as prefix input and uses the first (most reliable) head to re-predict \(\{\tilde{t}_{k+2}, ..., \tilde{t}_{k+n}\}\). Only the longest consistent prefix between the two steps is accepted. For numerical tokens, a distance threshold \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\) is used instead of strict equality to increase acceptance rate.
- Design Motivation: Small errors in numerical tokens (coordinates, heights) in structured language are acceptable; distance-based matching is more appropriate than the exact matching used in natural language settings.
Confidence-Guided Decoding (CGD):
- Function: Predicts tokens and confidence scores within the same inference step, enabling immediate filtering.
- Mechanism: Each additional head is paired with a Confidence Head that predicts the probability \(c_{k+i}\) that the token agrees with the first head's output. During training, a BCE loss supervises this: \(\hat{c}_{k+i}=1\) if \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\). At inference, tokens with \(c_{k+i} < \epsilon\) are marked as unreliable. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{MTP}} + \lambda_c \mathcal{L}_c\).
- Design Motivation: SSD requires an additional verification step that increases latency. CGD completes prediction and verification within the same step, offering a more elegant single-step solution at the cost of training an additional Confidence Head.

Loss & Training¶

MTP loss: \(\mathcal{L}_{\text{MTP}} = -\sum_k \sum_i \lambda_h^{i-1} \log p(t_{k+i}|t_{\leq k})\), where \(\lambda_h\) is a decay factor assigning lower weight to more distant token predictions.
Confidence loss: \(\mathcal{L}_c = -\sum_{i,k} \lambda_h^{i-1} (\hat{c}_{k+i} \log c_{k+i} + (1-\hat{c}_{k+i}) \log(1-c_{k+i}))\)

Key Experimental Results¶

Main Results (ASE Dataset, Layout Estimation)¶

Method	n	Params	Latency	α (tokens/step)	F1-Score (test)
SceneScript	1	14.00M	382ms	1	0.915
SceneScript+MTP	4	18.14M	109ms	4	0.889
SceneScript+MTP	8	23.67M	62ms	8	0.842
SceneScript+MTP	10	26.43M	54ms	10	0.814
Fast SceneScript (SSD)	8	15.05M	81ms	7.45	0.913
Fast SceneScript (CGD)	8	16.10M	92ms	6.30	0.913
Fast SceneScript (SSD)	10	15.05M	75ms	8.97	0.912

Cross-Dataset Comparison (Structured3D, Layout Estimation)¶

Method	Latency	F1-Score
RoomFormer	54ms	0.702
SceneScript	1176ms	0.774
Fast SceneScript (SSD, n=8)	230ms	0.791
Fast SceneScript (CGD, n=8)	269ms	0.795

Key Findings¶

SSD accepts more tokens per step than CGD (7.45 vs. 6.30) with lower latency, but CGD requires no additional verification step.
The parameter-efficient mechanism substantially reduces parameters: at n=8, from 23.67M to 15.05M (−36%), adding only 7.5% over the original SceneScript.
At n=10, vanilla MTP accuracy degrades severely (F1 drops to 0.814), while Fast SceneScript maintains 0.912.
The distance threshold \(\tau\) for numerical tokens significantly improves acceptance rate: introducing distance-based matching yields ~1 additional accepted token per step under SSD.
Experiments on SceneCAD validate both layout estimation and object detection, achieving ~5× speedup with improved accuracy on both tasks.

Highlights & Insights¶

Determinism of Structured Language as a Natural Advantage for MTP: The strong structural constraints of structured language — e.g., coordinates must follow make_wall — make multi-token prediction far more feasible than in natural language. This insight generalizes to all structured output tasks (e.g., code generation, SQL query synthesis).
"Predict Aggressively, Accept Selectively" Paradigm: Rather than demanding accuracy at every token position, the method makes bold predictions and filters out unreliable ones. SSD and CGD offer complementary trade-offs: SSD achieves lower latency but requires an extra verification pass; CGD is more elegant but requires training a confidence head.
Parameter Sharing Strategy: By exploiting the shared semantic nature of the language model's hidden space, a single lightweight Projection Block replaces \(n-1\) independent heads, reducing parameters by 43% without sacrificing accuracy.

Limitations & Future Work¶

SSD requires an additional forward pass for verification, making actual speedup slightly below the theoretical upper bound.
CGD's confidence threshold \(\epsilon\) requires manual tuning and may need to be adjusted across different datasets.
Validation is currently limited to 3D scene understanding tasks; applicability to 2D perception (e.g., object detection) remains unexplored.
Token filtering retains only the longest reliable prefix, meaning a single unreliable intermediate token causes all subsequent tokens to be discarded, which may be overly conservative.

vs. SceneScript: The direct predecessor; Fast SceneScript preserves its architecture and interface, accelerating only the inference stage.
vs. Medusa / DeepSeek-V3: Natural language MTP methods. This work is the first to apply MTP to structured perception language models, identifying the need for distance-based rather than exact-match token acceptance.
vs. RoomFormer: The conventional detection-based approach has lower latency (54ms vs. 230ms) but lower F1 (0.702 vs. 0.791) and lacks the flexibility of language model-based methods.

Rating¶

Novelty: ⭐⭐⭐⭐ First application of MTP to structured perception language models; CGD and parameter-sharing designs are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, two tasks, and detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Methodology is clearly presented with intuitive table design.
Value: ⭐⭐⭐⭐ 5× inference speedup holds significant engineering value for real-time 3D perception systems.