StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall¶

Conference: ACL2026
arXiv: 2604.26243
Code: https://github.com/seucoin/StratMem-Bench.git
Area: LLM Evaluation / Virtual Character Conversation / Long-term Memory
Keywords: strategic memory use, virtual characters, long-term conversation, memory selection, LLM evaluation

TL;DR¶

StratMem-Bench categorizes memories in virtual character conversations into three types: must, nice, and irr. It evaluates whether models can actively incorporate beneficial memories and suppress irrelevant ones while ensuring factual requirements are met. The results reveal that current powerful LLMs remains significantly unstable in "supportive memory selection."

Background & Motivation¶

Background: Long-term dialogue and virtual character systems are typically equipped with external memory to allow characters to remember past experiences, user preferences, and personas. Most existing benchmarks evaluate factual recall, measuring whether relevant facts are retrieved and reflected in the response.

Limitations of Prior Work: Memory usage in human conversation follows an "optimum," not a "maximum." Some memories are mandatory for answering a question, some merely make the response more natural, empathetic, or personalized, and others, though present in the memory bank, should not be mentioned in the current context. Benchmarks focusing solely on factual recall fail to measure this selection capability.

Key Challenge: Virtual characters must be proactive and caring like real humans without forcing irrelevant personal matters into the conversation. Models need to dynamically balance proactivity and risk aversion.

Goal: Construct an evaluation set that explicitly distinguishes between required, supportive, and irrelevant memories, and design metrics to measure whether a model uses "all that must be used, the appropriate amount of what can be used, and none of what should not be used."

Key Insight: Drawing from Gricean Maxims, must memories correspond to factual correctness, nice memories correspond to quantity of information and social coherence, and irr memories correspond to violations of relevance.

Core Idea: Transform memory from a static warehouse of facts into a dynamic resource within a conversation, evaluating the model's functional judgment of each memory piece given the current query, persona, and history.

Method¶

Overall Architecture¶

StratMem-Bench models strategic memory use as a conditional response generation task. Each sample provides the model with a dialogue history, a current user query, a character persona, and an unlabeled pool of memories. The model must independently judge which memories contribute functionally to the response. Data is derived from LoCoMo multi-session conversations. Memories and personas are extracted from early sessions, while subsequent sessions serve as dialogue history to generate the current query, thereby avoiding temporal leakage.

Key Designs¶

1. Instance-level labeling of three memory roles: Allowing memory roles to change with the query Fixing a memory as "relevant" or "irrelevant" ignores the shifting goals of dialogue. StratMem-Bench labels memories based on their functional contribution to the current dialogue goal: must is required for factual correctness; nice is not mandatory for correctness but enhances personalization or empathy; irr is irrelevant or obtrusive. Initial labels are generated by GPT-5.1 and verified by human experts (Fleiss' kappa = 0.81).

2. Strict Memory Compliance (SMC): Reducing strategic memory use to pass/fail Average quality scores often dilute serious errors like missing mandatory memories. SMC uses hard rules: must-only requires all must memories used and no irr; nice-only requires at least one nice and no irr; must+nice requires all must, at least one nice, and no irr. This binary determination exposes hard failures in strategic selection.

3. MIQ, PES, and CIR Behavioral Profiling: Separating "selection" from "integration" A single score cannot explain why a model fails (e.g., being too conservative vs. being too aggressive). Three metrics are used: MIQ (1–5 scale) evaluates how naturally selected memories are integrated; PES measures proactivity by checking if the model enriches responses when nice memories are available; CIR measures risk aversion by checking the rate of irr memory misuse.

This is a benchmark paper; no new models are trained. Evaluation uses a unified instruction template. Automatic evaluation is conducted by DeepSeek-V3.2. Memory usage detection requires evidence citation and majority voting (Cohen's kappa with humans = 0.96 for detection, 0.69 for MIQ).

Key Experimental Results¶

Main Results¶

The dataset contains 657 samples, with must+nice being the most common and difficult setting.

Scenario	Samples	Avg Memories	Avg Tokens/Mem	Eval Meaning
must-only	50	6.24	9.53	Satisfy necessary facts & suppress irr
nice-only	132	9.12	10.09	No hard facts; test proactive enrichment
must+nice	475	8.97	9.75	Satisfy facts, enrichment, and suppression
Overall	657	8.79	9.81	Full evaluation

Model	SMC must-only	SMC nice-only	SMC must+nice	SMC All	MIQ All on pass
GPT-5.2	88.00	57.58	41.89	48.55	4.45
GPT-5-chat	90.00	46.21	41.68	46.27	4.56
Claude Sonnet 4.5	90.00	53.03	46.95	51.45	4.37
Gemini 3 Pro	78.00	49.24	48.21	50.68	4.21
DeepSeek-reasoner	76.00	48.48	39.16	43.84	4.12
Qwen3-235B	92.45	46.56	42.28	47.18	4.24

Powerful models perform well on must-only (76%-92% SMC) but show significant degradation on nice-only and must+nice. Models handle "mandatory facts" well but struggle to judge "supportive memories."

Ablation Study¶

This analysis decomposes behavioral dimensions as a diagnostic tool.

Model	must-used MIQ	nice-used MIQ	irr-used MIQ	PES All	CIR All
GPT-5.2	4.48	4.22	2.99	56.01	13.01
GPT-5-chat	4.55	4.38	2.81	51.91	7.91
Claude Sonnet 4.5	4.36	4.18	3.05	62.48	15.82
Gemini 3 Pro	3.92	3.73	2.63	73.33	31.96
DeepSeek-chat	4.32	4.12	2.75	56.96	15.49
Qwen3-Max	4.14	4.04	2.64	57.76	19.77

Key Findings¶

Once must memories are selected, MIQ is high, suggesting the bottleneck is memory selection rather than linguistic expression.
Nice memories incur an "enrichment tax": nice-used MIQ is ~0.2 lower than must-used MIQ, suggesting personalized info is harder to integrate naturally.
Irr memories cause quality collapse (MIQ 2.6-3.1), indicating that irrelevant info disrupts coherence beyond just being off-topic.
A clear trade-off exists between PES and CIR: higher proactivity (Gemini 3 Pro) comes with higher risks of including irrelevant memories.

Highlights & Insights¶

Advances long-term memory evaluation from "can it remember" to "should it speak," which is closer to product-level risk for assistants.
The must/nice/irr categorization acknowledges that dialogue quality involves both factual correctness and appropriate personalization.
The combination of SMC (hard metric), MIQ (quality), and PES/CIR (behavioral tendency) allows for detailed failure diagnostics.

Limitations & Future Work¶

Focuses on single-turn response generation, omitting dynamic strategy shifts in multi-turn interactions.
Limited to text-based memory, excluding multimodal aspects like voice, appearance, or location.
Synthetic data may not capture the full complexity of privacy, emotion, and social boundaries in real-world interactions.
Reliance on LLM judges may introduce biases toward specific stylistic patterns.

vs. LoCoMo / LongMemEval: Moves beyond factual recall to functional dialogue roles.
vs. Personalized RAG: Emphasizes dynamic selection and the risks of over-personalization.
vs. Persona Consistency: Shifts focus from identity consistency to the human-like timing of memory usage.
Insight: For assistant systems, one could explicitly model must/nice/irr tags in the memory manager and use CIR as a safety/UX metric.

Rating¶

Novelty: ⭐⭐⭐⭐☆
Experimental Thoroughness: ⭐⭐⭐⭐☆
Writing Quality: ⭐⭐⭐⭐☆
Value: ⭐⭐⭐⭐⭐