Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?¶

Conference: NeurIPS 2025 arXiv: 2511.21998 Code: GitHub Area: Multimodal VLM Keywords: streaming video understanding, interactive guidance, error detection, Mamba, step-by-step guidance

TL;DR¶

This paper introduces the Qualcomm Interactive Cooking benchmark and the LiveMamba model, presenting the first systematic evaluation of multimodal LLMs for providing real-time, step-by-step task guidance in streaming video — encompassing instruction delivery, completion detection, and error feedback.

Background & Motivation¶

Despite their strong conversational capabilities, most multimodal LLMs are limited to turn-based interaction, generating responses only when explicitly queried. A genuinely useful AI assistant, however, must be capable of asynchronous reaction within a video stream:

Proactively delivering the next instruction: automatically issuing the next step once the user completes the current one
Detecting instruction completion: determining whether the user has successfully executed the current instruction
Identifying and reporting errors: alerting the user promptly upon mistakes

Existing datasets (Epic-Kitchens, Ego4D, HowTo100M, etc.) primarily capture expert demonstrations or routine activities and lack user error scenarios, making them insufficient for evaluating interactive guidance capabilities. The authors leverage the CaptainCook4D dataset — which contains user error recordings — to construct the first benchmark with precise timestamp-aligned instruction and feedback annotations.

Method¶

Overall Architecture¶

LiveMamba is a lightweight streaming multimodal LLM composed of a visual encoder and a language model backbone:

Visual Encoder: InternViT-300M-448px extracts \(M\) visual tokens per frame
Q-Former Adapter: compresses \(M\) tokens to \(K\) tokens via 4-layer cross-attention
Language Backbone: Mamba-130M recurrent model, enabling efficient long-sequence inference
Replanning Module: an external Qwen3-32B model that re-orders instructions when the user deviates from the plan

Key Designs¶

"When-to-Say" Mechanism: Two special tokens are introduced:

<vision>: requests the next video frame as input
<response>: triggers instruction or feedback generation at an appropriate moment

After each frame, the model autonomously decides whether to continue observing or emit a response, without any external prompting.

Iterative Replanning: When the user skips or reorders steps, LiveMamba invokes the external replanner, which receives the initial plan, completed steps, and feedback, then selects the optimal next instruction.

Data Augmentation Strategies¶

Instruction Completion Augmentation (ICAug): Action descriptions from Epic-Kitchens and Ego4D are converted into instruction–feedback pairs, issuing an instruction at action onset and confirming completion at action offset.

Counterfactual Error Augmentation (CFAug): "Plausible counterfactual errors" are generated by modifying action descriptions into realistic mistake scenarios (e.g., changing "add 1 teaspoon of salt" to "add 1 tablespoon of salt"), producing training data for error conditions.

Temporal Jittering: A random perturbation of \(\pm 30\) seconds is applied to instruction timestamps, preventing error accumulation in the autoregressive model.

Loss & Training¶

The model is trained with a standard autoregressive language modeling loss. During pretraining, only the Q-Former is optimized to align visual and textual embeddings; during fine-tuning, both the Q-Former and the Mamba language backbone are jointly trained.

Key Experimental Results¶

Main Results¶

Zero-Shot Evaluation (Streaming Mode, Main Set):

Model	IC-Acc ↑	Prec. ↑	Rec. ↑	F1 ↑	BERT ↑	ROUGE-L ↑
Gemini-2.5-Flash	23.1	0.01	0.22	0.02	0.410	0.342
Qwen2.5-VL-7B	18.9	0.18	0.01	0.02	0.299	0.219
VideoLLaMA3-7B	1.8	0.00	0.00	0.00	0.000	0.000

Fine-Tuned Evaluation (Streaming Mode, Main Set):

Model	IC-Acc ↑	Prec. ↑	Rec. ↑	F1 ↑	BERT ↑	ROUGE-L ↑
LiveMamba (Full)	31.5	0.17	0.10	0.13	0.651	0.561
LiveMamba (w/o-CFAug)	14.3	0.12	0.03	0.05	0.558	0.511
LiveMamba (w/o-ICAug)	7.8	0.05	0.01	0.01	0.605	0.542
Videollm-online†	7.6	0.04	0.01	0.01	0.434	0.412

Turn-Based Evaluation (Main Set):

Model	IC-Acc ↑	F1 ↑	BERT ↑	ROUGE-L ↑
LiveMamba†	51.0	0.19	0.631	0.535
Qwen2.5-VL-7B	38.9	0.06	0.348	0.230
Qwen2-VL-7B	19.4	0.11	0.398	0.293

Ablation Study¶

Component	IC-Acc	Error F1
Full LiveMamba	31.5	0.13
w/o ICAug	7.8 → 14.3	0.01 → 0.05
w/o CFAug	14.3	0.05
w/o Replanning (Adv Set)	10.9	0.16
w/ Replanning (Adv Set)	12.6	0.19

Key Findings¶

All zero-shot MLLMs perform extremely poorly on streaming interactive guidance; the best-performing Gemini-2.5-Flash achieves only 23.1% IC-Acc.
Counterfactual error augmentation raises error F1 from 0.05 to 0.13, demonstrating the critical importance of high-quality error data.
LiveMamba, using a Mamba-130M backbone, processes frames at 4× the input rate in real time (8.1 fps vs. 2 fps input), with a latency of only 1.1 seconds.

Highlights & Insights¶

First real-time interactive guidance benchmark: Qualcomm Interactive Cooking fills the gap in step-by-step guidance evaluation for streaming video, comprising 94 hours of densely annotated data.
Lightweight and efficient architecture: The Mamba-130M backbone makes the model suitable for edge deployment (smartphones, smart glasses).
Counterfactual augmentation strategy: Automatically generated plausible error scenarios address the scarcity of error training data.
Dual streaming/turn-based evaluation: Provides a comprehensive assessment perspective — streaming evaluation reflects real-world conditions, while turn-based evaluation facilitates per-step progress tracking.

Limitations & Future Work¶

The benchmark is limited to the cooking domain; generalization to other task settings has not been verified.
Fine-grained error detection remains highly challenging (e.g., distinguishing 1 teaspoon vs. 1 tablespoon).
Instruction completion accuracy on the Advanced Planning Set remains low (12.6%), indicating that complex plan reasoning requires further improvement.
Replanning relies on an external large model (Qwen3-32B), incurring an average latency of 6.1 seconds, which limits real-time applicability.

VideoLLM-online: The first online video dialogue framework, but supports narration only rather than interactive guidance.
CaptainCook4D: Provides cooking videos containing user errors, offering the foundation for constructing an interactive benchmark.
Mamba Architecture: The efficiency advantages of recurrent models for long-sequence inference are fully demonstrated in the streaming video setting.
Key insight: Streaming interactive AI assistants require a "proactive speech" capability, which is fundamentally distinct from the conventional question-answering paradigm.

Rating¶

⭐ Novelty: 4/5 — First systematic definition of real-time interactive guidance as a task; well-designed dataset and evaluation framework.
⭐ Experimental Thoroughness: 4/5 — Covers zero-shot, fine-tuned, ablation, and turn-based evaluations from multiple angles, though limited to a single domain.
⭐ Writing Quality: 4/5 — Problem definition is clear and experimental organization is well-structured.
⭐ Value: 4/5 — Opens a new research direction in streaming interactive guidance; the benchmark holds long-term value.