Evaluating Memory Capability in Continuous Lifelog Scenario¶
Conference: ACL 2026 arXiv: 2604.11182 Code: https://github.com/RayNeo-AI-2025/LifeDialBench Area: Object Detection Keywords: lifelog memory, online evaluation, wearable devices, RAG baseline, long-term dialogue
TL;DR¶
This paper proposes LifeDialBench, a benchmark for evaluating memory capabilities in continuous lifelog scenarios, comprising EgoMem (7 days of real-world data) and LifeMem (1 year of simulated data). An online evaluation protocol is introduced to enforce temporal causality. Counterintuitively, a simple RAG baseline consistently outperforms all complex memory systems.
Background & Motivation¶
Background: Wearable devices such as Ray-Ban Meta smart glasses and Xiaomi AI Glasses now support always-on microphones capable of continuously recording ambient conversations, creating substantial opportunities for memory system applications. LLM-based memory systems typically consist of a memory manager, summarization agent, and retriever.
Limitations of Prior Work: Existing memory benchmarks focus primarily on online one-on-one chat or human–AI interactions, overlooking the unique demands of continuous lifelogs—multi-party interactions, casual and temporally ordered event streams, and simulated social networks. More critically, conventional offline evaluation protocols suffer from "temporal leakage," whereby systems are allowed to access the full dataset before answering any query, systematically overestimating real-world performance.
Key Challenge: Existing complex memory systems (e.g., graph-based or hierarchical approaches) introduce lossy compression through summarization and entity extraction, which may discard details that are critical in lifelog scenarios. However, due to the absence of rigorous online evaluation protocols, this information loss is obscured by temporal leakage in offline evaluation.
Goal: (1) Construct a memory evaluation benchmark suited to continuous lifelog characteristics; (2) propose an online evaluation protocol that respects temporal causality; (3) reveal the true capabilities of existing memory systems.
Key Insight: Real-world egocentric video data from the EgoLife dataset (6 participants, 7 days) is used to construct authentic scenarios, while LLM-based simulation of one year of daily life extends the temporal scope. A strict online evaluation paradigm is adopted, in which information flows linearly along the timeline and the system may only use information available prior to each query timestamp.
Core Idea: Evaluating memory systems under strict temporal causality constraints reveals a counterintuitive finding—a simple RAG baseline outperforms all sophisticated dedicated memory systems, because preserving raw text is more important than lossy compression.
Method¶
Overall Architecture¶
LifeDialBench consists of two complementary subsets: (1) EgoMem—constructed from the real-world EgoLife dataset (6 participants, 7 days) via bottom-up hierarchical summarization; and (2) LifeMem—generated by simulating one year of daily life using LLMs via top-down hierarchical expansion. Both subsets generate QA pairs from multi-level event summaries and support the online evaluation protocol.
Key Designs¶
-
Hierarchical Life Simulation Framework:
- Function: Generates multi-party continuous conversation logs with long temporal spans and diverse scenarios.
- Mechanism: EgoMem follows a bottom-up approach—progressing from second-level video clips → minute-level summaries → hour-level → day-level → week-level summaries. LifeMem follows a top-down approach—an LLM first designs an annual outline → monthly plans → daily events → specific dialogues, simulating a year of life with multi-party interactions. Qwen3-235B-Instruct is used for all dialogue and summary generation.
- Design Motivation: EgoMem provides real-world grounding (7 days suffices for proof-of-concept), while LifeMem provides long temporal span and scenario diversity (1 year); the two are complementary.
-
Online Evaluation Protocol:
- Function: Eliminates temporal leakage in offline evaluation, ensuring assessment reflects real-world deployment conditions.
- Mechanism: The protocol strictly enforces temporal linearity—the system starts from an empty state and receives conversational data incrementally in chronological order. At each evaluation point associated with a query timestamp, the system may only use information stored prior to that timestamp. Information is updated incrementally, and evaluation is performed intermittently during data ingestion rather than only after all data has been stored.
- Design Motivation: Conventional offline evaluation grants systems an "oracle" perspective—allowing information from December to be consulted when answering a query about February. The online protocol eliminates this unfair advantage and simulates realistic deployment conditions.
-
Multi-Dimensional Query Design:
- Function: Comprehensively probes memory retrieval capabilities across different types and granularities.
- Mechanism: Three query categories are designed: (a) temporal localization—determining when an event occurred; (b) factual retrieval—recalling specific details; (c) compositional reasoning—cross-event association and inference. QA pairs are generated from multi-level event summaries to ensure coverage of memory demands at different temporal granularities.
- Design Motivation: Lifelog queries extend far beyond simple fact retrieval, requiring the integration of temporal reasoning, cross-event association, and detail recall.
Loss & Training¶
As a benchmark paper, no model training is involved. Four representative memory systems are evaluated: a simple RAG baseline, summarization-based compression methods, graph-structured methods, and hierarchical memory methods.
Key Experimental Results¶
Main Results¶
| Memory System | EgoMem | LifeMem | Notes |
|---|---|---|---|
| Simple RAG | Highest | Highest | Retrieves raw text directly |
| Summarization-based compression | Below RAG | Below RAG | Lossy compression discards details |
| Graph-structured method | Below RAG | Below RAG | Over-engineering is detrimental |
| Hierarchical memory method | Below RAG | Below RAG | Complex structure, inferior performance |
Ablation Study¶
| Evaluation Mode | Performance Difference | Notes |
|---|---|---|
| Online evaluation | All systems score lower | Performance drops after eliminating temporal leakage |
| Offline evaluation | Universally inflated | Temporal leakage present |
| Online vs. offline ranking change | Ranking reversals observed | Offline evaluation may misrank systems |
Key Findings¶
- Counterintuitive result: The simple RAG baseline consistently outperforms all complex memory systems, including advanced graph-structured and hierarchical approaches.
- Lossy compression (summarization, entity extraction) is more harmful than beneficial in lifelog scenarios—preserving detailed information outweighs structured abstraction.
- Temporal retrieval is a universal bottleneck across all methods—"when did it happen" is harder to answer than "what happened."
- Online evaluation reveals true capability gaps masked by offline evaluation—certain systems that perform well offline degrade significantly under online conditions.
- The design direction of current memory systems may reflect a fundamental misjudgment—high-fidelity context preservation is more important than intelligent compression.
Highlights & Insights¶
- Importance of the online evaluation protocol: The temporal leakage problem in offline evaluation is exposed, with broad implications for all temporally structured AI evaluation. Similar information leakage issues may exist in many NLP benchmarks.
- Counterintuitive finding that simplicity works: Carefully designed complex memory systems underperform a simple RAG baseline, suggesting that data fidelity is currently more important than structured abstraction.
- Forward-looking lifelog scenario: As smart glasses and similar devices proliferate, continuous lifelogs will become a critical AI application domain; this benchmark provides the necessary evaluation infrastructure for this direction.
Limitations & Future Work¶
- LifeMem dialogues are synthesized by LLMs and may not fully capture the randomness and messiness of real conversations.
- EgoMem covers only 7 days and 6 participants, limiting temporal and demographic diversity.
- Simple RAG may face retrieval efficiency challenges as data volume grows substantially (e.g., years of logs).
- Multimodal memory (e.g., memory incorporating visual information) is not evaluated.
Related Work & Insights¶
- vs. LoCoMo: Focuses on human–human dialogue but not continuous recording, and lacks online evaluation. LifeDialBench is more aligned with real-world scenarios.
- vs. LongMemEval: Covers human–AI interaction with up to 50K sessions but lacks multi-party and continuous characteristics.
- vs. MemBank: Covers 10 days of human–AI interaction with limited scale and scenario diversity. LifeDialBench covers one year of multi-party scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Both the online evaluation protocol and the counterintuitive findings represent important contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple memory systems, two subsets, and online/offline comparison.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear; discussion of counterintuitive findings is in-depth.
- Value: ⭐⭐⭐⭐⭐ Identifies a fundamental design flaw in current memory systems with broad impact.