Skip to content

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Conference: ICML 2026
arXiv: 2602.06025
Code: https://github.com/ViktorAxelsen/BudgetMem
Area: LLM Agent / Long-term Memory / Runtime Compute Orchestration
Keywords: agent memory, runtime extraction, budget-tier routing, RL router, performance-cost trade-off

TL;DR

BudgetMem reformulates "runtime agent memory extraction" into a modular pipeline consisting of "filtering → parallel entity/temporal/topic extraction → summarization." Each module is equipped with LOW/MID/HIGH budget-tier interfaces. A shared lightweight router, trained via PPO, selects the appropriate tier for each module upon receiving a query, simultaneously improving F1/Judge scores and reducing the average per-query cost on LoCoMo, LongMemEval, and HotpotQA.

Background & Motivation

Background: Currently, mainstream LLM agent memory systems primarily follow an "offline, query-agnostic" path. Chat histories are pre-compressed, summarized, and written into vector databases or knowledge graphs immediately upon generation. Approaches like MemoryBank, Mem0, and A-MEM completely decouple "indexing" from "memory usage," requiring only retrieval during QA.

Limitations of Prior Work: This "build once, use always" paradigm is decoupled from specific queries, leading to waste—pre-processing compute may be entirely irrelevant to the current query—and fragility—offline summarization/compression might discard details crucial for specific queries. A more natural alternative is "on-demand" extraction from raw history when a query arrives. However, this shifts expensive LLM calls to runtime, making cost and latency primary concerns. Existing runtime memory systems (e.g., ReadAgent, LightMem) lack explicit control knobs for the cost-performance trade-off.

Key Challenge: To controllably trade quality for cost at runtime, two conflated sub-problems must be addressed: where to allocate the budget (at which pipeline granularity) and how to implement the budget (e.g., reducing token usage can be achieved through different implementations, reasoning methods, or model sizes).

Goal: To construct a unified runtime memory framework where "budget units" are explicit at the module level, "budget implementation methods" can be compared side-by-side, and the overall trade-off can be learned rather than manually tuned.

Key Insight: By structuring memory extraction as a multi-stage modular pipeline and forcing each module to implement an identical "budget-tier interface" (providing three tiers under the same I/O contract), routing decisions are reduced to a small-scale sequential decision problem: "selecting one of three tiers for each module."

Core Idea: A shared small router uses the query and the intermediate state of the previous stage as its state. It uses PPO to learn query-aware module-level tier selection based on a "task reward + cost penalty," shifting cost control from offline/manual to online/learnable.

Method

Overall Architecture

Given a history \(H\), a task-agnostic lightweight chunking yields a chunk library \(C=\{c_i\}_{i=1}^{N}\). When a query \(q\) arrives, a retriever \(R\) returns candidates \(C_q = R(q, C)\subset C\). Memory extraction is defined as \(m = f_{mem}(q, C_q)\), and the final answer \(\hat y = f_{ans}(q, m)\) is generated by a fixed LLM. \(f_{mem}\) is a fixed-structure modular pipeline: a filtering module \(M_{fil}\) refines \(C_q\) into \(\tilde C_q\), then entity (\(M_{ent}\)), temporal (\(M_{tmp}\)), and topic (\(M_{top}\)) modules extract \(e, t, p\) in parallel, which are finally aggregated into \(m\) by a summarization module \(M_{sum}\). The pipeline structure remains constant; only the tiers within each module are switched by the router.

Key Designs

  1. Module-level budget-tier interface:

    • Function: Wraps each module \(M\) in the same I/O contract while allowing three internal versions (LOW/MID/HIGH), enabling the router to perform fine-grained compute allocation without breaking the pipeline structure.
    • Mechanism: All modules share an abstract signature (input query + upstream intermediate state, output current intermediate representation). The three tiers correspond to different implementation complexities or inference costs; routing becomes a discrete 3-way choice.
    • Design Motivation: Applying budget knobs directly to the answer LLM is "after-the-fact" and ignores the costs incurred during memory extraction. Moving the budget to the module level allows the system to identify exactly where "over-consumption" occurs (e.g., filtering vs. entity extraction) and apply targeted adjustments.
  2. Three orthogonal tiering strategies:

    • Function: Facilitates a side-by-side comparison of "implementation/reasoning/capacity" as ways to trade quality for cost, avoiding the conflation of these metrics found in existing work.
    • Mechanism: (i) implementation tiering—LOW uses rules/pattern matching, MID uses BERT-like small experts, HIGH uses LLMs; (ii) reasoning tiering—LOW uses direct answering, MID uses CoT, HIGH uses multi-step/reflection under the same backbone; (iii) capacity tiering—uses the same algorithm with different LM sizes. These trade compute across three orthogonal axes: algorithm, reasoning form, and parameter scale.
    • Design Motivation: In reality, reasoning level and model size are often changed simultaneously, making it difficult to discern the effective knob. Separating these axes informs system designers which strategy is most cost-effective at various budget levels.
  3. PPO-trained query-aware shared router:

    • Function: Models routing as a sequential decision process—observing state \(s_k\) at each module invocation step \(k\), outputting action \(a_k\in\{\text{LOW},\text{MID},\text{HIGH}\}\), and training end-to-end via a "performance + cost" reward.
    • Mechanism: The state \(s_k\) consists of a compact embedding of the query \(q\), upstream module outputs, and a "module descriptor." A full pipeline run for a single query constitutes an episode. The task reward \(r_{task}\in[0,1]\) combined with the raw cost \(c_{raw}=\sum_k c(M_k, a_k)\) (LLM tiers calculated by token price, non-LLM tiers ignored) forms the reward \(r = r_{task} + \lambda\cdot\alpha\cdot r_{cost}\). Cost is normalized via sliding window quantiles \(\tilde c = (\sqrt{c_{raw}}-Q_5)/(Q_{95}-Q_5)\), with \(r_{cost}=1-\mathrm{clip}(\tilde c,0,1)\). A variance alignment factor \(\alpha = \mathrm{std}(r_{task})/(\mathrm{std}(r_{cost})+\epsilon)\) prevents high-variance terms from dominating gradients.
    • Design Motivation: The path contains non-differentiable LLM calls, necessitating RL. The parameter \(\lambda\) allows users to adjust quality-cost preferences at deployment, while \(\alpha\) addresses training instability caused by magnitude mismatches between task and cost rewards.

Loss & Training

The routing policy \(\pi_\theta\) is optimized using PPO, with one episode per query and the reward defined by Eq. (7). \(\lambda\) acts as a tunable preference switch during deployment: \(\lambda\) is decreased for performance-first scenarios and increased for tight-budget scenarios, requiring no router retraining (only adjusting the preference curve of the rewards).

Key Experimental Results

Main Results

Evaluations were conducted on three long-term memory/long-context benchmarks: LoCoMo, LongMemEval, and HotpotQA. Metrics include F1, LLM-as-a-Judge, and "Average Cost per Query." The following table summarizes average results for the performance-first setting using the LLaMA-3.3-70B-Instruct backbone (averaged across three datasets).

Method Avg F1 Avg Judge Avg Cost ↓
MemoryBank 23.75 28.47 4.14
A-MEM 30.47 40.27 32.07
Mem0 22.26 35.66 6.92
MemoryOS 26.03 37.14 18.04
LightMem 35.45 49.21 5.63
BudgetMem-IMP 41.84 57.36 1.29
BudgetMem-REA 44.19 57.39 1.52
BudgetMem-CAP 45.72 59.99 1.38

All three tiering strategies outperform strong runtime baselines like LightMem by 6-10 F1 points while reducing the average cost to less than 1/4. Capacity tiering offers the highest performance ceiling, while implementation tiering is the most economical at the low-budget end.

Ablation Study

Configuration Key Phenomenon Explanation
All-HIGH (No routing) Slight performance gain, cost explosion Confirms that "one-size-fits-all big models" is a common but inefficient default.
All-LOW (No routing) Minimal cost, significant drop in F1/Judge Critical details are lost when applying rules/small models indiscriminately.
Random Routing + Fixed Budget Similar cost but lower scores than BudgetMem Proves that query-aware learned routing is indispensable.
w/o \(\alpha\) (No variance alignment) Cost reward dominates late training; router collapses to All-LOW Indicates that reward-scale alignment is crucial for stable training.

Key Findings

  • Given the same compute budget, capacity tiering typically provides the highest quality ceiling, but implementation tiering is Pareto-superior in "extremely tight budget" ranges—indicating that the optimal knob varies by budget segment.
  • The router is shared and lightweight: all modules use the same policy, distinguishing contexts only via module descriptors, which avoids parameter explosion and data sparsity associated with "per-module routers."
  • Sliding window quantile normalization of \(r_{cost}\) maps costs from different datasets into the same \([0,1]\) range, which is a practical prerequisite for cross-dataset transfer of RL routing.

Highlights & Insights

  • The abstraction of "module-level + 3-tier interface + shared small router" is very clean: it transforms the budget control problem from a mess of engineering parameters into a simple 3-way RL problem, a paradigm applicable to other agent subsystems (tool calling, retrieval depth, reflection steps).
  • By isolating implementation/reasoning/capacity tiering and comparing them on the same testbed, the study provides empirical evidence for the practice of "switching implementations for low budgets and switching capacity for high budgets."
  • Addressing the "task vs. cost reward magnitude conflict" via sliding window quantiles and variance alignment is a detail often overlooked in RL-for-LLM-routing, but it significantly impacts convergence stability and can be reused in any dual-objective RL training.

Limitations & Future Work

  • The current pipeline ("filter → entity/temporal/topic → summary") is manually designed; optimal module partitioning for other domains (e.g., code agents, vision agents) remains to be explored.
  • The three discrete budget tiers are practical but potentially too coarse; future work could extend this to continuous budgets or longer tier lists, necessitating finer exploration strategies.
  • \(\lambda\) is a manually tuned knob at deployment; the authors do not provide a method to automatically derive \(\lambda\) given an SLA (Service Level Agreement).
  • Costs are measured by token price, neglecting real-world deployment costs such as GPU time, caching, and cross-node communication.
  • vs LightMem / MemoryOS: Both use runtime memory, but LightMem implicitly embeds cost control in the pipeline design; BudgetMem explicitly exposes module-level tiers to sweep the entire Pareto curve.
  • vs Mem0 / MemoryBank / A-MEM: These are offline memory construction methods ("build then query"). BudgetMem uses an on-demand approach, emphasizing query-aware extraction while using the router to keep inference costs comparable to or lower than offline methods.
  • vs LLM Routers (e.g., RouteLLM, LLM-Blender): Classic LLM routing only switches models at the "answer LLM" stage. BudgetMem brings routing into the extraction pipeline, representing a fine-grained extension of LLM routing within an agentic system.

Rating

  • Novelty: TBD
  • Experimental Thoroughness: TBD
  • Writing Quality: TBD
  • Value: TBD