DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding¶

Conference: NeurIPS 2025 arXiv: 2505.18411 Code: GitHub Area: Multi-modal VLM Keywords: Temporal Point Process, Danmaku, Multi-modal Benchmark, Large Language Models, Temporal Reasoning

TL;DR¶

This paper introduces DanmakuTPPBench, the first multi-modal Temporal Point Process (TPP) benchmark integrating temporal, textual, and visual modalities. It comprises DanmakuTPP-Events (7,250 video sequences with 10.8 million danmaku events collected from Bilibili) and DanmakuTPP-QA (10 evaluation tasks constructed via a multi-agent pipeline), revealing significant gaps in current LLM/MLLM capabilities for TPP understanding.

Background & Motivation¶

Background: TPP has broad applications in social media, healthcare, and finance, yet existing TPP datasets are nearly all unimodal (timestamps and event types only), impeding the development of multi-modal TPP models.
Limitations of Prior Work: Existing datasets lack textual and visual context. The TPP comprehension capabilities of LLMs/MLLMs remain largely unexplored. No dedicated TPP question-answering benchmark exists.
Key Challenge: Real-world event streams inherently contain multi-modal information, yet unimodal data cannot support the training and evaluation of models that exploit such information.
Goal: Construct the first natively multi-modal TPP dataset and question-answering benchmark.
Key Insight: The Bilibili danmaku system naturally constitutes a multi-modal TPP with precisely timestamped text content aligned to video frames.
Core Idea: Danmaku serves as a natural multi-modal TPP data source; a multi-agent pipeline is employed to automatically construct the QA benchmark.

Method¶

Overall Architecture¶

The benchmark consists of two complementary components: (1) DanmakuTPP-Events, which provides data for conventional TPP modeling; and (2) DanmakuTPP-QA, which provides QA tasks for evaluating LLM/MLLM TPP comprehension. Data are collected from 7,250 videos by Bilibili's 2024 Top-100 creators across 14 video categories.

Key Designs¶

DanmakuTPP-Events Dataset:
Function: The first multi-modal TPP modeling dataset.
Mechanism: Videos are collected from Bilibili's 2024 Top-100 creators; each danmaku event comprises a timestamp \(t_i\), an event type \(e_i\) (9 categories), a text token \(m_i^{text}\), and a video frame \(m_i^{image}\).
Design Motivation: Danmaku natively fuses temporal, textual, and visual modalities across 14 video categories.
Multi-Agent Construction Pipeline:
Function: Automated construction of high-quality QA data.
Mechanism: Five agents collaborate — a Task Design Agent (DeepSeek-R1 designs 10 task types), an Annotation Agent (Qwen2.5 + Qwen2.5-VL + RAM for annotation), a Quality Control Agent (Qwen3 majority voting), a Visualization Agent (Qwen2.5-Coder generates charts), and a Task-Solving Agent (multi-LLM majority voting for answer generation).
Design Motivation: The scale and complexity of danmaku data render manual annotation infeasible; the multi-agent pipeline ensures quality at scale.
10 Evaluation Tasks:
Function: Comprehensive assessment of TPP comprehension.
Mechanism: 8 closed-ended tasks (danmaku burst counting, time prediction, sentiment prediction, event type reasoning, etc.) and 2 open-ended tasks (global sentiment dynamics analysis and danmaku burst causal analysis).
Design Motivation: Tasks span a range of difficulty levels, from simple prediction to complex multi-modal reasoning.

Loss & Training¶

Conventional TPP model evaluation: RMSE (time prediction) and log-likelihood (modeling performance).
QA evaluation: Accuracy/RMSE for closed-ended tasks; Qwen3-235B scoring (0–1) for open-ended tasks.
Fine-tuning: Qwen2.5-VL-3B + LoRA, single RTX 4090, 3 epochs.
Each MLLM samples only 3 video frames per evaluation; MLLM input includes the danmaku event sequence as text and the sampled video frames.

Key Experimental Results¶

Main Results¶

Model	T-1 (ACC)	T-2 (RMSE↓)	T-7 (ACC)	T-8 (ACC)
Qwen2.5-7B	0.33	27.64	10.67	32.67
Qwen2.5-72B	0.67	1.28	16.00	43.83
DeepSeek-V3	25.00	1.30	13.67	34.50
Qwen2.5-VL-72B	0.33	1.14	15.98	47.17
Fine-tuned 3B	27.00	1.35	-	-

Key Findings¶

Among conventional TPP models, NHP achieves the best performance (log-likelihood 0.799).
Scaling model size improves TPP comprehension (RMSE decreases from 27.64 to 1.28).
Visual information (MLLMs) does not consistently improve performance, highlighting ongoing challenges in multi-modal fusion.
Danmaku burst counting (T-1) proves difficult for all models, with a maximum accuracy of only 27%.
Fine-tuning a 3B model can approach the performance of 72B models on certain tasks.
Across model families, Qwen3 performs best on sentiment-related tasks (lowest T-4 RMSE of 0.20), while DeepSeek-V3 and Llama-3.3 lead on sentiment polarity prediction (T-5/T-6).
MLLMs do not consistently outperform LLMs — Llama-3.3-70B achieves the lowest RMSE on T-2 (1.11), suggesting that language models can infer temporal patterns from linguistic cues.
The fine-tuned 3B model reduces error on sentiment prediction tasks (T-4/5/6) by 4–6× compared to the best pre-trained model (RMSE: 0.05/0.16/0.08), but overfits on T-3 (RMSE: 220.43).
On open-ended tasks, Qwen3-235B performs best on causal analysis (T-10, score 0.52), while Qwen2.5-VL-72B leads on global sentiment analysis (T-9, score 0.48).

Highlights & Insights¶

The choice of danmaku as a TPP data source is creative — it is natively multi-modal, large-scale, and rich in social signals.
The multi-agent pipeline constitutes a scalable paradigm for dataset construction.
The taxonomy of 9 danmaku event types has independent value for sociological research.
The benchmark exposes substantial gaps in LLMs/MLLMs' ability to understand temporal event sequences.
Fine-tuning Qwen2.5-VL-3B with LoRA on a single RTX 4090 for 3 epochs surpasses 72B pre-trained models on sentiment tasks, demonstrating the importance of task-specific adaptation.

Limitations & Future Work¶

Data are sourced exclusively from the Chinese-language Bilibili platform; cross-platform and cross-lingual generalization remains to be validated.
Each MLLM samples only 3 video frames; incorporating more frames may improve performance.
Danmaku data may contain inappropriate content, necessitating content moderation.
Conventional TPP models do not leverage multi-modal information, motivating the development of new multi-modal TPP architectures.
The five agents in the pipeline have clearly delineated roles: DeepSeek-R1 designs tasks; Qwen2.5/Qwen2.5-VL/RAM handle annotation; Qwen3 controls quality via majority voting; Qwen2.5-Coder generates visualizations; and multi-LLM majority voting produces final answers.
The 8 closed-ended tasks cover danmaku burst counting, time prediction, precise time prediction, sentiment/polarity prediction, and event type reasoning; the 2 open-ended tasks require global dynamics analysis and causal attribution.

vs. Retweet/StackOverflow datasets: These provide only timestamps and event types, lacking textual and visual information.
vs. Amazon Review: Textual content is available but no visual modality; DanmakuTPP provides all three modalities.
vs. TSQA (Time-Series QA): TSQA targets general time series, whereas DanmakuTPP focuses specifically on event sequences.
vs. Language-TPP: Language-TPP attempts to apply LLMs to TPP but uses only unimodal text data; DanmakuTPP introduces the first natively multi-modal TPP evaluation.

Rating¶

Implementation Details¶

Data are collected from 7,250 videos by Bilibili's 2024 Top-100 creators across 14 categories. Five agents collaborate: DeepSeek-R1 designs tasks, Qwen2.5 annotates, and Qwen3 controls quality. Fine-tuning employs Qwen2.5-VL-3B + LoRA on a single RTX 4090 GPU for 3 epochs. - Novelty: ⭐⭐⭐⭐⭐ First multi-modal TPP benchmark with an innovative danmaku data source. - Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation covering both conventional TPP models and LLMs/MLLMs. - Writing Quality: ⭐⭐⭐⭐ Detailed pipeline design with rich statistical analysis. - Value: ⭐⭐⭐⭐ Opens a new research direction for multi-modal TPP.