Skip to content

StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall

Conference: ACL2026
arXiv: 2604.26243
Code: https://github.com/seucoin/StratMem-Bench.git
Area: LLM Evaluation / Virtual Character Conversation / Long-term Memory
Keywords: strategic memory use, virtual characters, long-term conversation, memory selection, LLM evaluation

TL;DR

StratMem-Bench categorizes memories in virtual character conversations into three types: must, nice, and irr. It evaluates whether models can actively incorporate beneficial memories and suppress irrelevant ones while meeting factual requirements. It reveals that current powerful LLMs remain significantly unstable in "supportive memory selection."

Background & Motivation

Background: Long-term conversation and virtual character systems typically equip models with external memory, allowing characters to remember past experiences, user preferences, and personas. Most existing benchmarks assess factual recall, i.e., whether relevant facts are retrieved and reflected in the response.

Limitations of Prior Work: Human memory usage in conversation is not a case of "the more, the better." Some memories are mandatory for answering a question, some merely make the response more natural, empathetic, or personalized, and others, though present in the memory bank, should not be mentioned in the current scenario. If benchmarks only focus on fact recall, they fail to measure this selection capability.

Key Challenge: Virtual characters must be proactive and thoughtful like real humans without forcing irrelevant personal matters into the conversation. Models need a dynamic balance between proactivity and risk aversion.

Goal: Construct an evaluation set that explicitly distinguishes between required, supportive, and irrelevant memories, and design metrics to measure whether a model "uses all that is required, uses helpful ones moderately, and avoids what is not needed."

Key Insight: Drawing from Gricean Maxims, the paper maps "must" memories to factual correctness, "nice" memories to appropriate information quantity and social coherence, and "irr" memories to violations of relevance.

Core Idea: Reformulate memory from a static fact repository into dynamic conversational resources, evaluating the model's functional judgment of each memory under the current query, persona, and history.

Method

The task in StratMem-Bench is conditional response generation. Each sample contains conversation history, a current user query, a character persona, and an unlabeled memory pool. The model does not see the must, nice, or irr labels and must judge for itself which memories contribute functionally to the current response.

Overall Architecture

Data is derived from the LoCoMo multi-session virtual character dialogue dataset. The authors extract memory pools and personas from earlier sessions and use subsequent sessions as current conversation history to generate a current user query. This ensures that memories originate from the past and responses occur later, preventing temporal leakage.

Each memory is labeled into three categories based on its functional role for the current query. Must is memory essential for response correctness; nice is not required for correctness but enhances personalization, empathy, or social coherence; irr (irrelevant) is unrelated to the current dialogue goal and would be off-topic or abrupt if used.

The labeling process begins with initial tags generated by GPT-5.1, followed by review and discussion by multiple human experts. The authors report a Fleiss' kappa of 0.81 among three annotators before expert discussion, indicating that while category division involves subjectivity, the agreement is relatively stable.

Key Designs

  1. Instance-level Labeling of Three Memory Roles:

    • Function: Allows the same memory to hold different roles under different queries.
    • Mechanism: Labels are determined by the memory's functional contribution to the current dialogue goal rather than keyword overlap. For example, "moving to a new city" is a "must" when asking for an address, a "nice" when asking about recent life, and "irr" when asking about music preferences.
    • Design Motivation: Memory usage for virtual characters is essentially a contextual decision. Categorizing memory as fixedly relevant or irrelevant ignores changes in dialogue goals.
  2. Strict Memory Compliance (SMC):

    • Function: Uses hard constraints to measure if the model satisfies basic requirements of strategic memory use.
    • Mechanism: "must-only" samples require all musts to be used and irr to be excluded; "nice-only" requires at least one nice and no irr; "must+nice" requires all musts, at least one nice, and no irr.
    • Design Motivation: Traditional average quality scores can mask severe errors. SMC converts strategic selection into a pass/fail metric to identify bottlenecks in memory selection.
  3. MIQ, PES, and CIR Behavioral Profiling:

    • Function: Decouples "choosing the right memory" from "using the memory well."
    • Mechanism: MIQ (Memory Integration Quality) uses a 1-5 scale to evaluate if selected memories blend naturally; PES (Proactive Enrichment Score) measures proactivity in using nice memories when available; CIR (Irrelevant Citation Rate) measures the ratio of irr memories mistakenly used when nice memories are present.
    • Design Motivation: Some models are conservative (low CIR but insufficiently rich responses); others are proactive (high PES but prone to introducing irrelevant memory). Analyzing both is necessary to describe true behavior.

Loss & Training

As this is a benchmark paper, no new models were trained. During evaluation, all models use a uniform instruction template, inputting the persona, history, query, and unlabeled memory pool to generate a single-turn response in a zero-shot manner. Samples with excessive nice memories are downsampled to two nice memories to reduce the probability of hitting a nice memory by chance.

Automated evaluation is performed by DeepSeek-V3.2. Memory usage detection requires the evaluator to cite specific evidence from the response and utilizes majority voting across three samples; it achieved a Cohen's kappa of 0.96 with human experts on 1,130 memory-response pairs. MIQ achieved a Cohen's kappa of 0.69 against 300 human-annotated responses.

Key Experimental Results

Main Results

The dataset contains 657 samples. The "must+nice" scenario constitutes the majority and represents the difficult setting closest to real character dialogue.

Scenario Samples Avg. Memory Count Avg. Words per Memory Evaluation Meaning
must-only 50 6.24 9.53 Satisfy necessary facts and suppress irrelevant ones
nice-only 132 9.12 10.09 No hard factual needs; tests proactive enrichment
must+nice 475 8.97 9.75 Satisfy facts, enrichment, and suppression simultaneously
Overall 657 8.79 9.81 Full-scale evaluation
Model SMC must-only SMC nice-only SMC must+nice SMC All MIQ All on pass
GPT-5.2 88.00 57.58 41.89 48.55 4.45
GPT-5-chat 90.00 46.21 41.68 46.27 4.56
Claude Sonnet 4.5 90.00 53.03 46.95 51.45 4.37
Gemini 3 Pro 78.00 49.24 48.21 50.68 4.21
DeepSeek-reasoner 76.00 48.48 39.16 43.84 4.12
Qwen3-235B 92.45 46.56 42.28 47.18 4.24

Powerful models generally achieve 76%-92% SMC on "must-only" but show a significant decline on "nice-only" and "must+nice." This indicates that models can handle "mandatory facts" but struggle to judge "supportive memories that could enhance the dialogue."

Ablation Study

This paper does not feature model architecture ablations but provides a behavioral dimension decomposition, which serves as an evaluation diagnostic chart.

Model must-used MIQ nice-used MIQ irr-used MIQ PES All CIR All
GPT-5.2 4.48 4.22 2.99 56.01 13.01
GPT-5-chat 4.55 4.38 2.81 51.91 7.91
Claude Sonnet 4.5 4.36 4.18 3.05 62.48 15.82
Gemini 3 Pro 3.92 3.73 2.63 73.33 31.96
DeepSeek-chat 4.32 4.12 2.75 56.96 15.49
Qwen3-Max 4.14 4.04 2.64 57.76 19.77

Key Findings

  • Once a "must" memory is correctly selected, MIQ is usually high, indicating the current bottleneck is "which memories to select" rather than linguistic expression.
  • "Nice" memories incur an enrichment tax: nice-used MIQ is typically about 0.2 lower than must-used MIQ, suggesting that extra personalized information is prone to being integrated awkwardly or tenuously.
  • "Irr" memories cause a collapse in quality; irr-used MIQ scores mostly fall between 2.6-3.1, showing that irrelevant memories not only deviate from the topic but also damage character response coherence.
  • A clear trade-off exists between PES and CIR. GPT-5-chat has the lowest CIR (~7.91) but is more conservative in PES; Gemini 3 Pro has the highest PES (~73.33) but its CIR reaches 31.96, showing proactivity carries a higher risk of irrelevant memory intrusion.

Highlights & Insights

  • This paper advances long-term memory evaluation from "whether it can remember" to "whether it should speak." For real characters and personal assistant systems, this more closely resembles product risk than simple recall.
  • The division into must, nice, and irr is highly practical. It acknowledges that dialogue quality involves not just factual correctness, but also moderate personalization and relevance control.
  • SMC is a hard metric, MIQ is a quality metric, and PES/CIR are behavioral tendency metrics. Combining the four allows model failures to be decomposed into different modes such as "missing mandatory memory," "unable to proactively enrich," "misusing irrelevant memory," or "poor integration quality."
  • The study finds that high-performing models maintain high MIQ when they pass SMC, suggesting that improving strategic memory use may require better memory selection or policies rather than simply stronger generators.

Limitations & Future Work

  • The evaluation only covers single-turn response generation and does not examine how memory selection strategies change dynamically over multi-turn interactions.
  • It currently handles only textual memories, without covering multimodal character memories (voice, image, appearance, location, etc.).
  • Data is derived from LoCoMo conversion and synthetic pipelines; while human-reviewed, privacy, emotion, and social boundaries in real long-term user interactions are more complex.
  • The automated evaluation relies on an LLM judge. Although it shows high consistency with humans, it may still exhibit preferences for certain expression styles or model families.
  • vs LoCoMo / LongMemEval: These benchmarks primarily assess long-range memory retrieval and factual recall; StratMem-Bench further requires models to judge the conversational functional role of memories.
  • vs Personalized RAG: Personalized RAG focuses on introducing user preferences to improve response relevance; this paper emphasizes dynamic selection of must, nice, and irr during generation and evaluates the risks of over-personalization.
  • vs Character Consistency Evaluation: Traditional role-playing evaluations look at persona consistency; this paper examines whether characters "know when to mention the past" like a human.
  • Insight: For real-world assistant systems, one could explicitly model must/nice/irr or similar labels in a memory manager and use CIR as a safety and user experience metric.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Shifts from factual recall to strategic memory use; the problem definition is clear, and the metric combination is realistic.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers multiple strong models with human consistency validation, but lacks multi-turn and real-world user scenario evaluation.
  • Writing Quality: ⭐⭐⭐⭐☆ Tasks, metrics, and conclusions are clearly organized; some automated evaluation details are in the appendix, making the main text a smooth read.
  • Value: ⭐⭐⭐⭐⭐ Highly relevant for virtual characters, personal assistants, and long-term memory RAG, especially for diagnosing "too cold" vs. "too talkative" failures.