Training-free Online Video Step Grounding¶
Conference: NeurIPS 2025 arXiv: 2510.16989 Code: GitHub Area: Multimodal VLM Keywords: Video step grounding, Bayesian filtering, LMM zero-shot, online inference, training-free
TL;DR¶
This paper proposes BaGLM, a training-free online video step grounding method that integrates LLM-estimated step dependencies and LMM-estimated step progress into zero-shot LMM predictions via Bayesian filtering, outperforming existing trained offline methods on three datasets.
Background & Motivation¶
Video Step Grounding (VSG) aims to identify which steps are executed in a video, given a set of procedural step descriptions. This has significant applications in real-time AR/XR guidance (e.g., cooking assistance, furniture assembly).
Existing VSG methods face two core limitations:
Training data dependency: They require collected and annotated training data (e.g., narration text from HowTo100M), incurring high annotation costs and potentially biasing models toward specific video and task distributions.
Offline processing requirement: They assume access to the complete video, making them unsuitable for real-time video stream scenarios.
The central question explored in this paper is: Can VSG be performed online without any training? The authors first make a surprising finding — directly applying zero-shot LMMs to predict steps segment-by-segment already surpasses trained offline SOTA methods (InternVL2.5-8B exceeds MPTVA by 6.4% on CrossTask and NaSVA by 16.1% on Ego4D GoalStep). This motivates a natural improvement direction: can information from past frames be injected into LMM predictions while preserving the zero-shot advantage?
Method¶
Overall Architecture¶
BaGLM formulates VSG as a Bayesian filtering problem: the state is the step \(a\) corresponding to the current segment, and the observation is the current video segment \(\mathbf{S}_t\). The posterior belief \(\text{bel}_t(a)\) is recursively estimated through a predict step (leveraging inter-step transition relations) and an update step (leveraging LMM observation predictions).
Key Designs¶
- LMM as a Zero-Shot Observation Model
VSG is reformulated as a multiple-choice question: given the current video segment \(\mathbf{S}_t\) and all step options (including "none"), the LMM is prompted to output probabilities for each option. After normalization, a step probability distribution \(f_{\text{LMM}}^{\text{VSG}}: \mathcal{V} \times \mathcal{A} \to \Delta^{K+1}\) is obtained.
InternVL2.5-8B is used, with only the current 2-second segment and the step list as input — no full-video access is required.
- PREDICT Step: LLM-Driven Step Transition Model
An LLM (LLaMA3-70B-Instruct) is used to estimate an inter-step dependency matrix \(\mathbf{D} \in \mathbb{R}^{K \times K}\), where \(\mathbf{D}_{i,j}\) denotes the probability that step \(a_j\) is a prerequisite of step \(a_i\). The transition matrix is initialized as \(\mathbf{T} = \mathbf{D}^\top\).
A key innovation is dynamically adjusting the transition matrix based on step progress. Two metrics are introduced:
-
Readiness: the degree to which prerequisites of step \(a_i\) have been completed: \(\mathbf{r}_t[i] = \frac{\sum_j \mathbf{D}_{i,j} \cdot \max_{\tau < t} \text{progress}_\tau[j]}{\sum_j \mathbf{D}_{i,j}}\)
-
Validity: whether the successors of step \(a_i\) have not yet been executed (preventing repeated attribution): \(\mathbf{v}_t[i] = \frac{\sum_j \mathbf{D}_{j,i} \cdot (1 - \max_{\tau < t} \text{progress}_\tau[j])}{\sum_j \mathbf{D}_{j,i}}\)
The adjusted transition matrix is: \(\tilde{\mathbf{T}}_t[i,j] = \frac{\mathbf{T}[i,j] \cdot \mathbf{r}_t[j] \cdot \mathbf{v}_t[j]}{\sum_k \mathbf{T}[i,k] \cdot \mathbf{r}_t[k] \cdot \mathbf{v}_t[k]}\)
Step progress is estimated by querying the LMM for the completion degree of each step on a 0–9 scale.
- UPDATE Step: Fusing LMM Predictions with the Bayesian Prior
The final belief is obtained as the product of the LMM observation likelihood and the predict-step prior:
\(\text{bel}_t(a_i) = \frac{1}{\mathcal{Z}} \cdot f_{\text{LMM}}(\mathbf{S}_t, \pi_{\text{VSG}})[i] \cdot \sum_{a_j \in \mathcal{A}} \tilde{\mathbf{T}}_t[j,i] \cdot \text{bel}_{t-1}(a_j)\)
where \(\mathcal{Z}\) is a normalization factor. This formulation elegantly integrates instantaneous LMM predictions with historically accumulated beliefs and inter-step dependencies.
Loss & Training¶
No training is required. All components rely on the zero-shot capabilities of pretrained models: InternVL2.5-8B for observation and progress estimation, and LLaMA3-70B for dependency matrix extraction. Videos are segmented into non-overlapping 2-second clips.
Key Experimental Results¶
Main Results¶
| Method | Setting | HT-Step R@1 | CrossTask Avg.R@1 | Ego4D R@1 |
|---|---|---|---|---|
| VINA | Offline+Trained | 39.1 | 44.8 | - |
| NaSVA | Offline+Trained | 53.1 | 46.7 | 29.1 |
| MPTVA | Offline+Trained | - | 47.9 | - |
| VSLNet | Offline+Trained | - | - | 24.3 |
| NaSVA (online) | Online+Trained | 46.1 | - | 24.2 |
| BaGLM | Online+Training-free | 57.4 | 59.8 | 43.3 |
BaGLM surpasses NaSVA by +4.3/+13.1/+14.2% on HT-Step/CrossTask/Ego4D, respectively.
Ablation Study: Transition Model Components¶
| Configuration | HT-Step | CrossTask | Ego4D |
|---|---|---|---|
| Static transition matrix only | 55.9 | 58.0 | 42.1 |
| + Readiness | 57.0 | 58.8 | 42.0 |
| + Validity | 56.4 | 58.8 | 43.1 |
| + Readiness + Validity | 57.4 | 59.8 | 43.3 |
Oracle Experiments¶
| Configuration | HT-Step | CrossTask | Ego4D |
|---|---|---|---|
| Estimated dependencies + Estimated progress | 57.4 | 59.8 | 43.3 |
| Oracle dependencies + Oracle progress | 62.6 | 66.9 | 82.2 |
The oracle setting yields a 38.9% improvement on Ego4D, demonstrating that the Bayesian filtering framework itself is highly effective and that improvements in progress and dependency estimation would yield substantial further gains.
Key Findings¶
- Zero-shot LMMs observing only the current segment already constitute a strong VSG baseline, surpassing specialized methods trained on HowTo100M.
- A 2-second segment duration is the optimal trade-off: shorter segments lack sufficient visual cues, while longer ones span multiple steps.
- BaGLM consistently improves over all LMMs on HT-Step and CrossTask, but gains are limited on Ego4D GoalStep (longer videos, coarser step descriptions).
- The choice of LLM (LLaMA-3.3-70B vs. GPT-4.1-mini) has minimal impact, indicating robustness to LLM selection.
Highlights & Insights¶
- The combination of Bayesian filtering and LMMs is highly elegant, seamlessly integrating classical probabilistic inference with modern large model capabilities.
- Fully training-free, online inference, and surpassing trained offline methods — significant advantages for real-world deployment.
- The dynamic transition model based on step progress estimation and the dependency matrix is the key innovation, injecting task-specific knowledge into the Bayesian framework.
- Oracle experiments clearly identify directions for future improvement.
Limitations & Future Work¶
- The method relies on the quality of LLM-estimated step dependencies, which may be inaccurate for ambiguous or highly domain-specific steps.
- Step progress estimation depends on the LMM's subjective judgment, limiting precision.
- Improvements are marginal on long videos such as Ego4D (average 28 minutes), indicating that long-range dependency modeling needs to be strengthened.
- Each segment requires two LMM calls (step prediction + progress estimation), and real-time performance is constrained by LMM inference speed.
Related Work & Insights¶
- The evolution of the VSG field from weak supervision (Zhukov et al.) to LLM-assisted pseudo-labeling (NaSVA, MPTVA).
- Works such as VQAScore demonstrate that LMMs can replace CLIP for video-language alignment evaluation.
- The classical application of Bayesian filtering in tracking and localization is creatively adapted here for step grounding.
- Inspiration: this framework can be generalized to other sequential video understanding tasks (e.g., action anticipation, procedural anomaly detection).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The training-free online paradigm combining Bayesian filtering with LMMs is entirely original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, four LMMs, ablation studies, oracle analysis, and segment duration analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematical modeling is clear; the progression from preliminary findings to method design is logically coherent.
- Value: ⭐⭐⭐⭐⭐ Establishes a new paradigm for training-free online inference in video understanding.