Skip to content

Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction

Conference: CVPR 2026 arXiv: 2512.05597 Code: None Area: 3D Vision Keywords: 3D scene understanding, multi-token prediction, structured language model, inference acceleration, self-speculative decoding

TL;DR

This paper proposes Fast SceneScript, which introduces multi-token prediction (MTP) into structured language models for 3D scene understanding to accelerate inference. Combined with self-speculative decoding (SSD) and confidence-guided decoding (CGD) to filter unreliable tokens, as well as a parameter-efficient head-sharing mechanism, the method achieves 5.09× and 5.14× speedups on layout estimation and object detection respectively without accuracy loss.

Background & Motivation

  1. Background: Structured language model-based 3D perception methods such as SceneScript represent 3D scenes as token sequences (e.g., [make_wall, x1, y1, z1, x2, y2, z2, height, thickness]), enabling a single model architecture to handle multiple tasks including layout estimation, 3D object detection, and coarse-grained reconstruction.
  2. Limitations of Prior Work:
    • Autoregressive next-token prediction (NTP) is slow; inference latency grows with sequence length (e.g., 1176ms on Structured3D);
    • Directly applying MTP reduces inference steps but severely degrades accuracy (F1-Score drops from 0.913 to 0.840 with 8 heads);
    • MTP introduces \((n-1)\) additional token heads, substantially increasing parameter count (14M→23.67M).
  3. Key Challenge: The trade-off between MTP speedup and token prediction accuracy, compounded by the overhead of additional parameters.
  4. Goal: How to achieve multi-fold inference acceleration of structured language models while preserving accuracy with minimal parameter overhead?
  5. Key Insight: Structured language (vs. natural language) exhibits stronger determinism and weak inter-token coupling, making MTP more feasible; the key challenge is designing reliable token filtering strategies to reject inaccurate predictions.
  6. Core Idea: Predict multiple tokens via MTP, then filter unreliable tokens through SSD/CGD, retaining only the longest reliable prefix — a "predict aggressively, accept selectively" acceleration paradigm.

Method

Overall Architecture

Input 3D point clouds are encoded into features via a sparse 3D ResNet. A language decoder (Transformer with self/cross-attention) predicts \(n\) future tokens and \((n-1)\) confidence scores in a single step, conditioned on preceding tokens and 3D features. A shared Projection Block and Token Head process the hidden states for each token position. A token filtering stage rejects unreliable tokens, accepting only the longest reliable prefix.

Key Designs

  1. Parameter-Efficient Multi-Token Prediction:

    • Function: Predicts \(n\) tokens per inference step, reducing decoding steps from \(N\) to \(\lceil N/n \rceil\).
    • Mechanism: Conventional MTP introduces independent token heads for each additional position, causing linear parameter growth. Fast SceneScript shares a single Token Head across all \(n\) heads. A lightweight Projection Block (2 FFN blocks, each with 2 linear layers + ReLU + LayerNorm) maps the language decoder hidden state \(f_{k+1}\) to \(n-1\) distinct hidden states \(f_{k+i}\). The Projection Block is shared across all heads, adding only ~7.5% parameters (vs. 69% for MTP-8).
    • Design Motivation: Language model hidden states are context-dependent and reside in a shared semantic space; although hidden states at different positions differ, they can be decoded by the same head. A Transformer FFN-like structure suffices to generate discriminative features.
  2. Self-Speculative Decoding (SSD):

    • Function: Filters unreliable tokens via two-step verification to preserve accuracy.
    • Mechanism: Step one uses \(n\) MTP heads to predict candidate tokens \(\{t_{k+1}, ..., t_{k+n}\}\); step two feeds these tokens as prefix input and uses the first (most reliable) head to re-predict \(\{\tilde{t}_{k+2}, ..., \tilde{t}_{k+n}\}\). Only the longest consistent prefix between the two steps is accepted. For numerical tokens, a distance threshold \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\) is used instead of strict equality to increase acceptance rate.
    • Design Motivation: Small errors in numerical tokens (coordinates, heights) in structured language are acceptable; distance-based matching is more appropriate than the exact matching used in natural language settings.
  3. Confidence-Guided Decoding (CGD):

    • Function: Predicts tokens and confidence scores within the same inference step, enabling immediate filtering.
    • Mechanism: Each additional head is paired with a Confidence Head that predicts the probability \(c_{k+i}\) that the token agrees with the first head's output. During training, a BCE loss supervises this: \(\hat{c}_{k+i}=1\) if \(|t_{k+i} - \tilde{t}_{k+i}| \leq \tau\). At inference, tokens with \(c_{k+i} < \epsilon\) are marked as unreliable. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{MTP}} + \lambda_c \mathcal{L}_c\).
    • Design Motivation: SSD requires an additional verification step that increases latency. CGD completes prediction and verification within the same step, offering a more elegant single-step solution at the cost of training an additional Confidence Head.

Loss & Training

  • MTP loss: \(\mathcal{L}_{\text{MTP}} = -\sum_k \sum_i \lambda_h^{i-1} \log p(t_{k+i}|t_{\leq k})\), where \(\lambda_h\) is a decay factor assigning lower weight to more distant token predictions.
  • Confidence loss: \(\mathcal{L}_c = -\sum_{i,k} \lambda_h^{i-1} (\hat{c}_{k+i} \log c_{k+i} + (1-\hat{c}_{k+i}) \log(1-c_{k+i}))\)

Key Experimental Results

Main Results (ASE Dataset, Layout Estimation)

Method n Params Latency α (tokens/step) F1-Score (test)
SceneScript 1 14.00M 382ms 1 0.915
SceneScript+MTP 4 18.14M 109ms 4 0.889
SceneScript+MTP 8 23.67M 62ms 8 0.842
SceneScript+MTP 10 26.43M 54ms 10 0.814
Fast SceneScript (SSD) 8 15.05M 81ms 7.45 0.913
Fast SceneScript (CGD) 8 16.10M 92ms 6.30 0.913
Fast SceneScript (SSD) 10 15.05M 75ms 8.97 0.912

Cross-Dataset Comparison (Structured3D, Layout Estimation)

Method Latency F1-Score
RoomFormer 54ms 0.702
SceneScript 1176ms 0.774
Fast SceneScript (SSD, n=8) 230ms 0.791
Fast SceneScript (CGD, n=8) 269ms 0.795

Key Findings

  • SSD accepts more tokens per step than CGD (7.45 vs. 6.30) with lower latency, but CGD requires no additional verification step.
  • The parameter-efficient mechanism substantially reduces parameters: at n=8, from 23.67M to 15.05M (−36%), adding only 7.5% over the original SceneScript.
  • At n=10, vanilla MTP accuracy degrades severely (F1 drops to 0.814), while Fast SceneScript maintains 0.912.
  • The distance threshold \(\tau\) for numerical tokens significantly improves acceptance rate: introducing distance-based matching yields ~1 additional accepted token per step under SSD.
  • Experiments on SceneCAD validate both layout estimation and object detection, achieving ~5× speedup with improved accuracy on both tasks.

Highlights & Insights

  • Determinism of Structured Language as a Natural Advantage for MTP: The strong structural constraints of structured language — e.g., coordinates must follow make_wall — make multi-token prediction far more feasible than in natural language. This insight generalizes to all structured output tasks (e.g., code generation, SQL query synthesis).
  • "Predict Aggressively, Accept Selectively" Paradigm: Rather than demanding accuracy at every token position, the method makes bold predictions and filters out unreliable ones. SSD and CGD offer complementary trade-offs: SSD achieves lower latency but requires an extra verification pass; CGD is more elegant but requires training a confidence head.
  • Parameter Sharing Strategy: By exploiting the shared semantic nature of the language model's hidden space, a single lightweight Projection Block replaces \(n-1\) independent heads, reducing parameters by 43% without sacrificing accuracy.

Limitations & Future Work

  • SSD requires an additional forward pass for verification, making actual speedup slightly below the theoretical upper bound.
  • CGD's confidence threshold \(\epsilon\) requires manual tuning and may need to be adjusted across different datasets.
  • Validation is currently limited to 3D scene understanding tasks; applicability to 2D perception (e.g., object detection) remains unexplored.
  • Token filtering retains only the longest reliable prefix, meaning a single unreliable intermediate token causes all subsequent tokens to be discarded, which may be overly conservative.
  • vs. SceneScript: The direct predecessor; Fast SceneScript preserves its architecture and interface, accelerating only the inference stage.
  • vs. Medusa / DeepSeek-V3: Natural language MTP methods. This work is the first to apply MTP to structured perception language models, identifying the need for distance-based rather than exact-match token acceptance.
  • vs. RoomFormer: The conventional detection-based approach has lower latency (54ms vs. 230ms) but lower F1 (0.702 vs. 0.791) and lacks the flexibility of language model-based methods.

Rating

  • Novelty: ⭐⭐⭐⭐ First application of MTP to structured perception language models; CGD and parameter-sharing designs are original contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, two tasks, and detailed ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Methodology is clearly presented with intuitive table design.
  • Value: ⭐⭐⭐⭐ 5× inference speedup holds significant engineering value for real-time 3D perception systems.