From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions¶

Conference: ACL 2025
arXiv: 2502.13791
Code: https://github.com/Cohere-Labs-Community/MemoryCode
Area: LLM Evaluation
Keywords: multi-session dialogue, long-term memory, coding instructions, prospective memory, LLM evaluation

TL;DR¶

Proposes MemoryCode, a synthetic multi-session dataset to evaluate the ability of LLMs to track and execute coding instructions over long-term interactions, finding that even GPT-4o's accuracy drops by 67% when provided with the full dialogue history, revealing fundamental limitations of current LLMs in prospective memory and information integration.

Background & Motivation¶

Background: LLMs perform excellently in single-task solving and are widely used in work environments. However, transitioning from "tools" to "teammates" requires the ability to retain and utilize information across multiple interactions.

Limitations of Prior Work: Existing multi-turn/multi-session evaluations either focus on simple information retrieval (Needle-in-a-Haystack) or lack practical task requirements. Most works only test memory when prompted to retrieve, failing to test prospective memory (spontaneous retrieval).

Key Challenge: Can LLMs, like a new colleague, utilize information accumulated from daily work interactions for future tasks without being explicitly prompted to recall?

Goal: Evaluate the ability of LLMs to spontaneously retrieve and execute previously received coding guidelines during long-term, cross-session interactions.

Key Insight: Designing a mentor-newbie onboarding scenario, where instructions are simple (e.g., "function names must start with g_") but scattered across a large amount of irrelevant information and can be updated over time.

Core Idea: LLMs perform perfectly in isolated instruction scenarios, but their accuracy plunges in realistic multi-session dialogue histories—indicating that the issue lies in retrieval and integration rather than execution capability.

Method¶

Overall Architecture¶

Manually design seeds (instructions, distractors, personas, company names) -> Templatized recombination -> LLM-generated multi-session dialogue histories -> Evaluation at different levels: Instruction Only / Single Session / Full History.

Key Designs¶

Four Types of Seeds
- Instructions (51 rules): Simple coding conventions (e.g., "function names must start with g_"), with 16 rules that can be updated up to 8 times.
- Distractors (80 items): Company policies, non-code related instructions.
- Personas: 6 types of mentors x 5 types of newbies.
- Names: Randomly fictionalized company and person names.
- Design Motivation: The combination of seeds ensures data diversity.
Controllable Dialogue History Generation
- Parametric templates: number of sessions (1-100), ratio of sessions containing instructions (50-70%), number of instructions per session (1-3), update ratio, etc.
- Command R+ to generate natural dialogues.
- Design Motivation: Precisely control complexity dimensions.
Three-Level Evaluation
- Level 1 — Instruction Only: Direct presentation of instructions + task to evaluate execution capability.
- Level 2 — Single Session: Presentation of a complete session containing instructions and distractors to evaluate intra-session retrieval.
- Level 3 — Full History: Presentation of concatenated sessions to evaluate cross-session retrieval and integration.
- Design Motivation: Progressively increase difficulty to dissociate "inability to execute" from "failure to retrieve".
Test Function Design
- Use regular expressions to detect instruction adherence (e.g., whether function names start with g_).
- Do not evaluate code functional correctness, only adherence to code specifications.
- Design Motivation: Decouple retrieval capability from programming capability.

Dataset Scale¶

360 dialogue histories, with 30 instances for each session count.
"Short" histories (< 15 sessions, 54%), "long" histories (> 15 sessions, 46%).
Maximum history length is 63K tokens.

Key Experimental Results¶

Main Results — Accuracy at Different Levels¶

Model	Level 1 (Instruction Only)	Level 2 (Single Session)	Level 3 (Full History)	Drop
GPT-4o	~95%	~85%	~28%	-67%
DeepSeek-R1	~93%	~80%	~35%	-58%
Claude-3.5	~92%	~82%	~30%	-62%
Command R+	~88%	~75%	~25%	-63%
Llama-3.1-70B	~85%	~65%	~20%	-65%

Impact of Session Count on Performance¶

Session Count	GPT-4o Accuracy	Small Model Accuracy
1-5	~70%	~50%
10-20	~35%	~20%
50-100	~15%	~5%

Ablation Study — Impact of Instruction Updates¶

Configuration	Accuracy	Description
No updates	Higher	Only requires retrieval
With updates (keep latest)	Significant drop	Requires integration and overriding old information

Key Findings¶

All models perform perfectly at Level 1: The instructions themselves are simple, indicating it is not an execution capability issue.
Level 3 accuracy plunges: GPT-4o drops from 95% to 28%, proving the bottleneck is retrieval, not execution.
Performance degrades with more sessions: At 100 sessions, almost all models fail.
Instruction updates present an additional challenge: Models must not only find the instruction but also identify the latest version.
DeepSeek-R1 (reasoning model) performs slightly better but remains poor: Reasoning capabilities cannot compensate for retrieval deficiencies.
Distracting information (especially instruction-like distractors) severely impacts performance.

Highlights & Insights¶

Three-level evaluation design precisely decouples "execution capability" from "retrieval capability"—the cliff-like performance drop from Level 1 to 3 clearly proves the bottleneck is retrieval.
Onboarding mentor scenario closely mirrors real-world work environments—which is the typical usecase for LLMs as coding assistants; the identified limitations directly affect product design.
Prospective memory (spontaneously recalling and applying past instructions without a prompt) is a fundamental human cooperative capability, yet LLMs almost entirely lack it.

Limitations & Future Work¶

Synthetic data may not fully reflect the complexity of real-world work scenarios.
Only evaluates coding conventions, failing to test more complex knowledge (such as architectural decisions).
Does not explore the mitigation effects of external memory mechanisms (e.g., RAG, memory modules).
Directions for improvement: evaluating LLMs with memory modules, adaptive information compression, and progressive instruction learning.

vs Needle-in-a-Haystack: NIAH explicitly prompts retrieval, whereas MemoryCode requires implicit/prospective memory.
vs MMMT-IF: MMMT-IF tests instruction following but does not involve multi-session updates.
vs LoCoMo: LoCoMo tests factual retrieval, while MemoryCode tests the application of instructions.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First benchmark to test multi-session prospective memory and instruction integration.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models × multiple session lengths × three-level evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Scenario design is close to reality, with clear experimental logic.
Value: ⭐⭐⭐⭐⭐ Provides direct insights for product design of LLMs as collaborative tools.