ICLR 2026 LLM Agent episodic memory language agent hybrid memory graph temporal reasoning agentic retrieval gist extraction

REMem: Reasoning with Episodic Memory in Language Agents¶

Conference: ICLR 2026 arXiv: 2602.13530 Code: intuit-ai-research/REMem Area: LLM Agent Keywords: episodic memory, language agent, hybrid memory graph, temporal reasoning, agentic retrieval, gist extraction

TL;DR¶

This paper proposes REMem, an episodic memory framework for language agents that employs a hybrid memory graph (temporally-aware gist nodes combined with factual triple nodes) and tool-augmented agentic reasoning, achieving improvements of 3.4% and 13.4% over the state of the art on episodic recall and episodic reasoning tasks, respectively.

Background & Motivation¶

Humans excel at recalling concrete experiences and reasoning within spatiotemporal contexts—a capability known as episodic memory—yet the memory systems of current language agents suffer from significant deficiencies:

Dominance of semantic memory: Existing systems (parametric memory, RAG, GraphRAG) primarily store decontextualized semantic knowledge, lacking temporal and spatial dimensions.

Absence of event modeling: Mem0 over-filters information, causing loss of details; Graphiti constructs entity-centric knowledge graphs that discard coherent event context; HippoRAG 2 lacks temporal dimension modeling.

Retrieval insufficient for reasoning: Existing approaches rely on simple similarity matching, which cannot support complex cross-event reasoning such as temporal range filtering, event ordering, and counting queries.

The core design philosophy of REMem draws on two insights: - Cognitive science indicates that humans rely more on gist-based representations than verbatim memory for decision-making. - Contextual dimensions such as time, location, and participants must be explicitly bound to event representations.

Method¶

Overall Architecture¶

REMem operates in two stages: 1. Indexing stage: Converts experiences into a hybrid memory graph. 2. Agentic reasoning stage: Iteratively retrieves and reasons over the memory graph using a carefully designed set of tools.

Key Designs: Indexing Stage¶

1) Gist extraction: - For each event or dialogue session, one or more natural-language gist statements are generated. - Each gist is annotated with a timestamp (reference time), with relative temporal expressions converted to absolute dates. - Gists capture core information including participants, actions, objects, locations, intentions, and quantities.

2) Fact extraction: - Structured \((subject, predicate, object)\) triples are extracted from both raw text and gists. - Each triple is augmented with Wikidata-style temporal qualifiers: point_in_time, start_time, and end_time. - Potentially contradictory historical records are retained to support temporal retrospective queries.

3) Graph construction: - Gist nodes: Context-level episodic representations, linked to phrase nodes from the same chunk. - Phrase nodes: Concept-level representations; subject/object nodes are directly connected via predicate edges. - Synonymy edges: Added between gist nodes whose embedding similarity exceeds a threshold of 0.8. - The result is a hybrid memory graph: a unified representation integrating concept-level and context-level information.

Key Designs: Agentic Reasoning Stage¶

A ReAct-style agent is employed, equipped with three categories of carefully designed tools:

Tool Category	Tool Name	Key Parameters
Retrieval	`semantic_retrieve`	query, start_time, end_time, temporal operator
Retrieval	`lexical_retrieve`	query, start_time, end_time, temporal operator
Graph exploration	`find_gist_contexts`	gist_id, time range
Graph exploration	`find_entity_contexts`	subject, object, predicate, time range, limit, ordering, offset, aggregation
Flow control	`output_answer`	answer

Reasoning follows a three-phase protocol: 1. Retrieval: Semantic/lexical retrieval to obtain seed nodes and preliminary evidence. 2. Graph exploration: Directed traversal from seed nodes to acquire event-level narratives and temporally anchored evidence. 3. Flow control: Output the final answer once sufficient evidence has been collected.

find_entity_contexts supports logical operations including temporal filtering, ordering, and aggregation, far surpassing simple similarity matching.

Loss & Training¶

REMem does not involve model training. The indexing stage uses an LLM for gist and fact extraction; the reasoning stage employs an LLM agent for tool invocation and inference.

Key Experimental Results¶

Main Results — Episodic Recall¶

Method	LoCoMo LLM-J	REALTALK LLM-J
NV-Embed-v2 (RAG)	73.0	59.5
Mem0	49.7	14.3
Graphiti	52.5	35.3
HippoRAG 2	74.0	55.8
REMem-S	77.5	65.3
REMem-I	76.2	63.7

REMem-S achieves an LLM-J score of 77.5% on LoCoMo (+3.5 vs. HippoRAG 2) and 65.3% on REALTALK (+9.5 vs. HippoRAG 2).

Main Results — Episodic Reasoning¶

Method	Complex-TR LLM-J	Test of Time EM
NV-Embed-v2 (RAG)	80.4	68.9
NV-Embed-v2 + TISER	88.3	68.9
HippoRAG 2	81.5	66.9
REMem-I	89.6	93.1
REMem-I + TISER	92.0	90.6

REMem-I achieves an EM of 93.1% on Test of Time, making it the only method to surpass 90%. Compared to Full-Context (79.7%), this represents a gain of +13.4 pp.

Ablation Study¶

Variant	LoCoMo LLM-J	Complex-TR LLM-J
REMem-I (full)	76.2	89.6
w/o Gists	48.9	80.9
w/o Facts	74.1	87.2
w/o synonymy edges	76.4	89.2
w/o semantic_retrieve	72.8	88.1
w/o lexical_retrieve	76.8	87.5

Removing gists has the largest impact: LLM-J on LoCoMo drops from 76.2 to 48.9 (−27.3), confirming that gists are the central carrier of episodic memory.
Facts are more critical for reasoning: Removing facts causes a −2.4 drop on Complex-TR.
The two retrieval tools are complementary: Semantic retrieval supports conceptual association, while lexical retrieval improves surface-form coverage.

Key Findings¶

Abstention behavior: REMem achieves F1 = 64.0% (Precision 73.3%) on unanswerable questions, substantially outperforming Graphiti (F1 53.1%) and Mem0 (F1 13.5%), exhibiting more accurate and balanced abstention.
Token efficiency: REMem-I consumes an average of 9K input tokens per query (REMem-S: 0.9K), compared to 26K for Full-Context.
Human evaluation: LLM judges agree with human ratings 93% of the time, validating the reliability of the LLM-as-judge evaluation scheme.
Error analysis: The primary error categories are selection/localization errors (46%), temporal/numerical reasoning errors (19%), and abstentions despite available evidence (18%).

Highlights & Insights¶

Cognitive science–driven design: The framework is grounded in gist-based memory theory and situation model theory, engineering psychological concepts into practical components.
Flexibility of the hybrid graph structure: The unified representation combining concept-level (fact triples) and context-level (gist) information supports both fine-grained and holistic understanding.
Only method exceeding 90% EM: REMem-I is the sole method surpassing 90% on the Test of Time benchmark, demonstrating strong temporal reasoning capability.
Divergence between iterative reasoning and single-step retrieval: REMem-I substantially outperforms REMem-S on reasoning tasks (EM 93.1 vs. 72.5), while the gap is smaller on recall tasks.
Carefully designed tool interfaces: find_entity_contexts supports temporal filtering, ordering, offset, and aggregation operations, serving as a critical enabler of reasoning capability.

Limitations & Future Work¶

The indexing stage relies on LLM-based extraction; the quality of gists and facts is therefore constrained by LLM capability.
The offline batch indexing paradigm makes streaming memory construction an engineering challenge.
Multi-step tool invocation in agentic reasoning increases inference latency and cost.
Experiments primarily use GPT-4.1-mini; generalizability to other models has not been thoroughly validated.
Ablations show that synonymy edges have limited impact (LLM-J change of only −0.2 to −0.4), warranting re-examination of their cost-benefit ratio.

Relationship to HippoRAG: HippoRAG constructs knowledge graphs inspired by the hippocampus for associative retrieval but lacks temporal and event dimensions; REMem explicitly models temporal timelines and contextual dimensions.
Relationship to Mem0: Mem0's aggressive filtering results in sparse memory; REMem retains comprehensive gist and fact records.
Relationship to TISER: TISER is a pure prompting method that guides temporal reasoning and can be used in complementary combination with REMem (+TISER raises Complex-TR LLM-J from 89.6 to 92.0).
Implications for agent systems: Episodic memory is foundational for agent personalization and continual learning; REMem provides a practical engineering solution.

Rating¶

Novelty: ⭐⭐⭐⭐ — The hybrid memory graph design is novel and the tool-augmented reasoning approach is conceptually coherent.
Practicality: ⭐⭐⭐⭐⭐ — Directly applicable to long-term memory augmentation in conversational agents.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four benchmarks, comprehensive comparisons, ablation analysis, human evaluation, and error analysis.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear and tables are information-dense, though some descriptions are overly verbose.
Overall: ⭐⭐⭐⭐ — Establishes a strong baseline in the episodic memory domain; engineering contributions outweigh theoretical ones.