TimeBill: Time-Budgeted Inference for Large Language Models¶

Conference: AAAI 2026 arXiv: 2512.21859 Code: None Area: Autonomous Driving / LLM Inference Optimization Keywords: Time-budgeted inference, KV cache eviction, response length prediction, execution time estimation, real-time systems

TL;DR¶

This paper proposes TimeBill, a framework that adaptively adjusts the KV cache eviction ratio under a given time budget via a fine-grained Response Length Predictor (RLP) and a workload-guided Execution Time Estimator (ETE), simultaneously maximizing LLM response quality while guaranteeing inference completion rate.

Background & Motivation¶

State of the Field¶

Large language models (LLMs) are increasingly deployed in time-critical systems such as robotics, autonomous driving, embodied intelligence, and industrial automation. In these scenarios, LLMs must generate accurate responses within hard real-time deadlines; failure to do so is treated as a system fault. Representative examples include: - Autoware.Flex, which leverages LLMs to translate natural language instructions into formats interpretable by autonomous driving systems - DriveGPT4, which uses LLMs to perceive the driving environment and produce driving decisions

Core Challenges¶

Execution time uncertainty: Unlike CNNs, the autoregressive generation process of LLMs renders end-to-end execution time highly uncertain, as it depends on response length.

Coarse-grained response length prediction: Existing predictors (e.g., 5-class classification in ProxyModel, 10-class in S3) operate at insufficient granularity, and BERT-based architectures struggle to handle long inputs.

Inflexibility of fixed KV cache eviction ratios: Different tasks carry different time budgets; a fixed eviction ratio either causes timeout (ratio too low) or severely degrades response quality (ratio too high).

Limitations of Prior Work¶

Offline methods (quantization, pruning): Compress the model before deployment and cannot adapt to time budgets at runtime.
Online methods (KV cache eviction/quantization): Methods such as StreamingLLM and SnapKV employ fixed eviction ratios and ignore time budget constraints.
Existing predictors: BERT-based predictors are constrained by context length and cannot handle long inputs; coarse-grained classification fails to provide sufficiently precise response time estimates.

Method¶

Overall Architecture¶

The TimeBill framework comprises three core components:

Fine-grained Response Length Predictor (RLP): Based on a small language model (SLM), it predicts the response length of the target LLM.
Workload-guided Execution Time Estimator (ETE): Combines FLOPs analysis with performance profiling to estimate end-to-end execution time.
Time-budget-efficient inference mechanism: Adaptively adjusts the KV cache eviction ratio $\alpha$ based on predicted execution time and the time budget.

Key Designs¶

1. Problem Formulation¶

Time-budgeted LLM inference is formulated as a constrained optimization problem:

\[\max_{\theta} \mathcal{M}(\hat{\mathbf{y}}(\theta), \mathbf{y})$$ $$\text{s.t.} \quad t_{\text{e2e}}(x, \theta) \leq T, \quad N \leq N_{\max}\]

where $\mathcal{M}(\cdot)$ denotes the response quality metric, $T$ is the time budget, and $N_{\max}$ is the maximum generation length. The objective is to maximize response quality subject to the time constraint.

2. Fine-Grained Response Length Predictor (RLP)¶

Core Idea: Response length prediction is formulated as a fine-grained classification task. An SLM (Qwen2.5-0.5B-Instruct) replaces BERT to support long-input processing.

Architecture: Embedding layer + $L$ decoder layers (RMSNorm–CausalAttention–RMSNorm–FFN/SwiGLU) + classification head
Bucket design: Response lengths are partitioned into buckets of fixed size $B$, with 512 buckets by default ($B=16$)
Knowledge distillation alignment: Actual response lengths $N_j$ from the target LLM are collected to construct a training dataset $(x_j, \lceil N_j/B \rceil)$, aligning the RLP with the target LLM

Post-processing caps the maximum predicted length:

\[\hat{N} = \min(N_{\max}, \text{Predict}(x) \cdot B)\]

Design Motivation: Compared to BERT, the SLM provides a longer context window and handles long inputs more effectively. Fine-grained classification (512 classes) yields more precise predictions than coarse-grained schemes (5–10 classes), and knowledge distillation ensures predictor–target alignment.

3. Workload-Guided Execution Time Estimator (ETE)¶

Core Idea: Execution time is accurately estimated by combining theoretical FLOPs modeling with performance profiling curve fitting.

FLOPs-based analysis: - Prefill stage: execution time is quadratic in input length $N_x$ (due to the $QK^T$ computation in CausalAttention) - Decoding step: execution time is linear in KV cache length $N_{kv}$

\[\hat{t}_{\text{prefill}}(x) = aN_x^2 + bN_x + c$$ $$\hat{t}_{\text{decoding}}^i(N_{kv}^i) = pN_{kv}^i + q\]

Performance profiling: Execution times are measured under varying $N_x$ and $N_{kv}$ configurations, and coefficients $a, b, c, p, q$ are fitted via least squares.

Effect of KV cache eviction on execution time: Under eviction ratio $\alpha$, the KV cache length at decoding step $i$ is:

\[N_{kv}^i(x, \alpha) = (1-\alpha)N_x + i - 1\]

A pessimism factor $k$ ($k \geq 1$) is introduced to estimate the worst-case execution time (WCET), ensuring hard real-time constraints are satisfied.

4. Time-Budget-Efficient Inference Mechanism¶

Core Idea: The original optimization problem is reformulated as minimizing the KV cache eviction ratio $\alpha$, since a higher eviction ratio degrades response quality.

The closed-form optimal eviction ratio is:

\[\alpha^* = \min\left(\alpha_{\max}, 1 - \frac{T - \hat{t}_{\text{prefill}}(x) - t_{\text{Predict}}(x)}{pN_x(\hat{N}_W - 1)} + \frac{\hat{N}_W - 2}{2pN_x} + \frac{q}{pN_x}\right)\]

System deployment: RLP prediction can be executed in parallel with the LLM prefill stage (on a CPU or a separate GPU); if the predictor's execution time is less than the prefill time, the prediction overhead is effectively zero.

Loss & Training¶

The RLP is trained with cross-entropy loss for the classification task.
The Arena-Human-Preference-100k dataset is used to construct training data, avoiding contamination of test sets.
The ETE is fitted via performance profiling data and least squares, requiring no neural network training.
KV cache eviction is implemented using SnapKV, with $\alpha_{\max}$ set to 95%.

Key Experimental Results¶

Main Results¶

Experiments are conducted on Qwen2.5-7B-Instruct with the LongBench dataset on an NVIDIA A40 GPU.

Method	Time Budget	Avg. Score (Kill)	Completion Rate (Kill)	Notes
Vanilla	5–10s	Lowest	Lowest	Frequently times out
$\alpha$=25%	5–10s	Low–medium	Medium	Insufficient eviction
$\alpha$=50%	5–10s	Medium	Medium–high	Rises then falls
$\alpha$=95%	5–10s	Low–medium	Among highest	Excessive eviction, poor quality
AWQ	5–10s	Marginally above Vanilla	Marginally above Vanilla	Orthogonally composable with TimeBill
TimeBill	5–10s	Highest	Comparable to $\alpha$=95%	Adaptive balance

Ablation Study¶

Response length predictor comparison:

Method	# Buckets	MAE↓	RMSE↓	R²↑
Ours (regression)	—	64.21	103.30	0.516
Ours (128 buckets)	128	48.95	87.57	0.652
Ours (256 buckets)	256	44.15	78.63	0.719
Ours (512 buckets)	512	42.71	78.13	0.723
ProxyModel	5	105.72	136.79	0.152
S3	10	108.96	148.91	−0.004

Execution time estimation accuracy: - Prefill stage MAPE: 1.22% - Decoding step MAPE: 1.69%

Effect of pessimism factor $k$ (T=5s, Kill strategy): - $k=1$–$5$: Increasing $k$ improves both completion rate and average score. - $k=6$–$8$: Excessively large $k$ leads to overly aggressive eviction ($\alpha$ too high), severely degrading response quality and reducing average score.

Key Findings¶

Fine-grained classification (512 buckets) achieves more than 2.5× higher prediction accuracy than coarse-grained schemes (5/10 buckets).
The SLM-based predictor reduces MAE by 60% compared to BERT-based predictors.
TimeBill achieves the highest average response score across all tested time budgets (5–10s).
A pessimism factor of $k=5$ is identified as optimal, consistent with common practice in hard real-time systems.

Highlights & Insights¶

Novel problem formulation: This work is the first to formally cast LLM inference as a time-budget-constrained optimization problem, providing a theoretical framework for the field.
Elegant closed-form solution: By combining FLOPs modeling with performance profiling, the optimal KV cache eviction ratio is derived analytically, eliminating the need for online search.
Efficient system design: Parallel execution of the RLP alongside the prefill stage eliminates any additional prediction overhead.
Strong practicality: The framework accommodates heterogeneous time budgets across different inference tasks and is orthogonally composable with offline methods such as quantization.

Limitations & Future Work¶

Validation is limited to single-GPU, single-request scenarios; batched inference and multi-request scheduling are not considered.
The RLP must be retrained for each target LLM, limiting transferability.
The pessimism factor $k$ requires manual selection; an adaptive adjustment mechanism is absent.
The KV cache eviction strategy is fixed to SnapKV; integration with alternative eviction strategies is unexplored.
End-to-end validation on real autonomous driving systems has not been conducted.

TimeBill is complementary to KV cache eviction methods such as SnapKV, providing a mechanism for dynamically adjusting the eviction ratio.
The work offers both a theoretical and practical framework for deploying LLMs in real-time systems.
The proposed approach may inspire analogous time-budget allocation strategies in multi-model collaborative inference scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ — Novel problem formulation, though the methodology primarily combines existing components.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple baselines, eviction strategies, and time budgets are evaluated comprehensively.
Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear and system design diagrams are thorough.
Value: ⭐⭐⭐⭐ — Strong reference value for real-time LLM deployment.