Skip to content

MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Conference: ACL 2026
arXiv: 2510.19457
Code: TBD (Not provided in the paper)
Area: Multimodal LMM / Knowledge Evaluation / Knowledge Editing / Time-Sensitive Knowledge
Keywords: time-sensitive knowledge, temporal awareness, benchmark, knowledge editing, LMM probing

TL;DR

The authors propose MINED—the first multimodal time-sensitive knowledge evaluation benchmark, containing 2104 \((subject, hypernym, property, attribute-list)\) quadruplets, across 6 dimensions (Cognition / Awareness / Trustworthiness / Understanding / Reasoning / Robustness) with 11 sub-tasks totaling 4208 questions. Evaluation of 15 LMMs shows that Gemini-2.5-Pro achieves the highest average CEM of 63.07 but still misses ~15% of knowledge. Furthermore, while knowledge editing methods like FT-LLM and IKE effectively update outdated knowledge in LLaVA-v1.5 and Qwen-VL under single editing, they degrade significantly under lifelong editing (FT-LLM drops by 43.2% on average).

Background & Motivation

Background: LMMs (such as LLaVA-v1.5, Qwen2.5-VL, Gemini-2.5-Pro) encode vast amounts of factual knowledge through large-scale pre-training. However, parameters are static—once Messi transfers to Inter Miami CF, the model's answer to "Which team does Messi play for now?" becomes outdated. On the text side, benchmarks like TimeQA, TempReason, and EvolveBench evaluate temporal reasoning, but they primarily test temporal expressions or logic rather than whether the model's internal time-sensitive facts are up-to-date. In the multimodal domain, only LiveVQA and MMKU-Bench address real-time visual knowledge updates, lacking a systematic temporal awareness evaluation.

Limitations of Prior Work: (i) Existing multimodal benchmarks cover only a single dimension (cognition or reasoning) rather than a joint six-dimensional evaluation; (ii) No benchmarks explicitly test "how models behave when query time and external context time are misaligned (temporal misalignment)," "how to refuse unanswerable dates outside the temporal window," or "how to understand implicit temporal concepts" (e.g., "during Bezos's tenure as Amazon CEO"), which are common but overlooked in real-world deployment; (iii) Lack of corresponding evaluation protocols—fine-grained evaluations like CEM and Prompt Agreement have not been standardized for multimodal time-sensitive scenarios.

Key Challenge: There is an inevitable gap between the static parametric knowledge of LMMs and dynamic real-world facts. Current evaluations only indicate that a model is wrong without explaining why (e.g., is it a cognition failure, an inability to understand implicit time, or being misled by misaligned context?), rendering subsequent improvements aimless.

Goal: (a) Construct a multimodal time-sensitive knowledge benchmark spanning 6 domains (country / sport / company / university / organization / competition), 6 dimensions, and 11 sub-tasks; (b) Evaluate 15 SOTA LMMs to identify common weaknesses; (c) Verify if existing knowledge editing methods can effectively update time-sensitive knowledge in multimodal settings.

Key Insight: The authors abstract each piece of time-sensitive knowledge into a quadruplet \((S, H, P, A)\), where \(S\) is the subject (e.g., Lionel Messi), \(H\) is the hypernym (e.g., footballer), \(P\) is the property (e.g., plays for), and \(A = [a_1, \ldots, a_n]\) is the attribute-list (e.g., ["FC Barcelona | 2003-2021", "PSG | 2021-2023", "Inter Miami | 2023-now"]). These quadruplets are then transformed via templates into 11 sub-tasks (time-agnostic / interval-aware / timestamp-aware / unanswerable date / implicit concept / ranking / calculation / adversarial error, etc.).

Core Idea: Decompose "time-sensitive knowledge capability" into a systematic "diagnostic panel" across six dimensions: cognition (recall) → awareness (context conflict detection) → trustworthiness (reject invalid time) → understanding (implicit time) → reasoning (rank/calc) → robustness (self-correct).

Method

Overall Architecture

The methodology consists of two pipelines. Benchmark Construction: ① Entity candidates are collected from Wikipedia across 6 domains using humans and GPT-4o → ② Two annotators manually filter for visual and time-sensitive entities → ③ Quadruplets \((S, H, P, A)\) and original images are extracted → ④ Five perception templates are used to remove entities that "10 out of 15 LMMs cannot recognize" to ensure visual perception standards are met → ⑤ Generalization images are crawled from Google using CLIP, taking the top-1 after similarity filtering. Task Data: Question-answer pairs are generated for each quadruplet using 11 sub-task templates (4208 total, 2104 unique knowledge points × multiple prompt configurations). Model Evaluation: 15 LMMs are tested across all sub-tasks using CEM scores and Prompt Agreement (averaging across 4 semantically equivalent prompts). Knowledge Editing: LLaVA-v1.5 (7B) and Qwen-VL (7B) are selected as "outdated models" to test 5 editing methods (FT-LLM / FT-VIS / MEND / SERAC / IKE) under both single and lifelong settings.

Key Designs

  1. Systematic Evaluation Decomposition (6 Dimensions × 11 Tasks):

    • Function: Breaks down "time-sensitive knowledge understanding" into 6 independent diagnostic dimensions, each mapping to a specific failure mode in real-world LMM deployment.
    • Mechanism: Each dimension corresponds to a type of real-world problem—Cognition uses T.A/T.I.A/T.S.A formats for recall (e.g., "Which club does the player currently play for?" is T.A, "From 2021 to 2023 he played for...?" is T.I.A, "On 2024-01-01 he played for...?" is T.S.A); Awareness uses F.M.C/P.M.C to see if models are misled when context time and query time are misaligned; Trustworthiness uses P.U.D/F.U.D to test refusal of dates outside the attribute window; Understanding uses I.T.C to parse implicit time; Reasoning is split into R.K (chronological ranking) and C.A (calculating days between events); Robustness uses A.T.E (adversarial temporal error) to check for self-correction when told the answer is wrong.
    • Design Motivation: Previous benchmarks reported only overall accuracy. The six dimensions provide localizable failure modes—e.g., Obs 2 found that "small models are extremely fragile to past misalignment context" (Qwen2-VL 7B P.M.C dropped by 56.43%), a diagnostic signal invisible in single-metric evaluations.
  2. \((S, H, P, A)\) Abstraction + Prompt Agreement Protocol:

    • Function: Uses a unified schema to represent all time-sensitive knowledge and stabilizes evaluation scores via prompt averaging.
    • Mechanism: Every knowledge point is abstracted as a quadruplet \((S, H, P, A)\) rather than a natural language QA pair, allowing batch template generation for different sub-tasks. For evaluation, Cover Exact Match \(\text{CEM} = \mathbb{1}(\hat y \subseteq Y)\) is used instead of strict EM, which is better suited for free-form responses. Prompt Agreement averages scores across 4 semantically equivalent prompts to mitigate prompt phrasing artifacts.
    • Design Motivation: Quadruplet abstraction makes the benchmark extensible and evolvable (quarterly updates of \(A\) via Wikipedia). CEM is more appropriate for "fact retrieval" than BLEU/F1.
  3. Multimodal Knowledge Editing in Single vs. Lifelong Settings:

    • Function: Explores whether knowledge editing methods can truly update outdated time-sensitive knowledge in LMMs.
    • Mechanism: LLaVA-v1.5 (7B) and Qwen-VL (7B) are used as outdated models. Single editing restores weights after each edit to measure clean update effects. Lifelong editing evaluates the entire dataset after batch editing to measure cumulative interference. It compares parameter-modifying (FT-LLM, FT-VIS, MEND) and parameter-preserving (SERAC, IKE) methods.
    • Design Motivation: Single editing checks theoretical update capability, while lifelong editing checks real-world scalability. Results showed FT-LLM achieved 97.2% in single avg but dropped to 54.0% in lifelong, whereas SERAC remained stable, providing a roadmap for method selection.

Loss & Training

This is primarily a benchmark paper; no new models were trained. For knowledge editing, the original losses were used (FT-LLM = standard CE fine-tuning, MEND = hypernetwork loss, SERAC = retrieval + counterfactual model loss). The primary metric is \(C_d = \frac{1}{N}\sum_i^N \text{CEM}_i\), where \(\text{CEM} = \mathbb{1}(\hat y \subseteq Y)\).

Key Experimental Results

Main Results

CEM (%) of 15 LMMs across 11 sub-tasks (Selection of 5 representative models):

Model T.S.A (Cog) F.M.C (Awa) P.M.C (Awa) P.U.D (Tru) F.U.D (Tru) I.T.C (Und) R.K (Rea) C.A (Rea) A.T.E (Rob) Avg
LLaVA-v1.5 (7B) 16.88 7.66 6.40 53.99 50.00 1.57 15.12 6.17 0.39 15.85
Qwen2.5-VL (7B) 41.67 40.04 33.98 99.64 99.76 4.02 38.89 25.00 16.86 39.55
InternVL2.5 (8B) 44.83 42.37 38.26 98.31 99.88 4.22 61.73 19.14 0.00 40.70
GPT-4.1 80.91 78.07 77.49 65.22 91.30 8.63 15.74 59.57 17.58 51.82
Gemini-2.5-Pro 84.96 83.09 84.30 80.31 97.10 18.73 38.48 76.54 39.58 63.07

→ Closed-source models lead significantly; I.T.C (implicit temporal concept) is a failure point for all models (max 18.73%); A.T.E (self-correction) is also a widespread weakness.

Ablation Study

Single vs. Lifelong knowledge editing (LLaVA-v1.5 7B, average CEM % across 9 tasks):

Method Single avg Lifelong avg Δ Evaluation
FT-LLM 97.2 54.0 −43.2 Strongest in single but collapses in lifelong
FT-VIS 86.6 34.8 −51.8 Visual-only editing is less stable
MEND 62.7 N/A Weak even in single editing
SERAC 61.6 51.2 −10.4 Mediocre in single but stable in lifelong
IKE 76.0 N/A In-context methods perform well in single

→ SERAC is 4× more robust than parameter-modifying methods in lifelong editing due to its memory-based architecture avoiding catastrophic forgetting.

Key Findings

  • Obs 1: Timestamp-Aware > Interval-Aware > Time-Agnostic: LMMs perform best on specific timestamp queries, suggesting internal knowledge is indexed point-in-time.
  • Obs 2: Small models are extremely fragile to past misalignment context: Qwen2-VL (7B) dropped 56.43% in P.M.C, whereas larger/closed models are more robust.
  • Obs 3: Rejecting future dates is more accurate than past dates: Future dates are "unseen concepts," leading to higher refusal confidence (~99% for Qwen2-VL).
  • Obs 5: Scaling up doesn't necessarily improve ranking: Qwen2.5-VL ranking accuracy dropped as model size increased from 3B (50.3) → 7B (38.9) → 72B (11.4), suggesting over-thinking.
  • Obs 7: Newer models have stronger temporal awareness: Release date correlates positively with Avg CEM.

Highlights & Insights

  • Six-dimensional diagnostic panel enables failure classification: Breaking down capabilities maps to specific deployment pain points (e.g., context conflict in RAG, refusal-to-answer in customer service).
  • Quadruplet + Quarterly updates = Evolvable benchmark: Abstracting knowledge into \((S, H, P, A)\) with a Wikipedia pipeline makes MINED a "living" benchmark, serving as data infrastructure.
  • I.T.C collapse serves as a wake-up call: The failure of even SOTA models on implicit temporal concepts (e.g., "during Bezos's tenure") indicates LMMs struggle with two-step reasoning involving temporal grounding.
  • Single vs. Lifelong editing comparison offers actionable value: The engineering conclusion that "FT-LLM is for small-scale updates, SERAC for lifelong" is highly practical for LMM-as-database scenarios.

Limitations & Future Work

  • The benchmark only covers 6 domains; critical areas like law or medicine are missing. Visual data is limited to static images.
  • The "evolvable" quarterly update pipeline has not yet been demonstrated over real multiple-cycle comparisons.
  • Knowledge editing experiments were conducted on older models (LLaVA-v1.5, Qwen-VL); conclusions may not fully generalize to the newest architectures.
  • I.T.C tasks rely on manually curated implicit time mappings, limiting sample size and generalizability.
  • vs. EvolveBench: EvolveBench measures text-side cognition/consciousness; this work extends to multimodal six dimensions and adds features like implicit concepts and adversarial robustness.
  • vs. LiveVQA / MMKU-Bench: Focuses on robustness against temporal misalignment, which LiveVQA overlooks.
  • vs. TimeQA / TempReason: These text benchmarks focus on expression logic, whereas MINED focuses on whether factual knowledge at specific time points is up-to-date.

Rating

  • Novelty: ⭐⭐⭐⭐ (First multimodal time-sensitive benchmark with 6-dim decomposition).
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (15 LMMs, multiple sub-tasks, and extensive knowledge editing tests).
  • Writing Quality: ⭐⭐⭐⭐ (Clear logic and actionable takeaways).
  • Value: ⭐⭐⭐⭐⭐ (Long-term value as an evolvable evaluation infrastructure).