RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs¶

Conference: ICLR 2026 arXiv: 2509.25426 Code: None Area: Interpretability Keywords: Reasoning Language Models, Model Routing, Item Response Theory, Multi-Objective Optimization, Adaptive Inference

TL;DR¶

This paper proposes RADAR, a framework that formulates adaptive inference for reasoning language models (RLMs) as a multi-objective optimization problem. It leverages Item Response Theory (IRT) to jointly estimate interpretable query difficulty and model configuration ability parameters, enabling lightweight and scalable query-level routing. RADAR outperforms state-of-the-art routing methods on 8 reasoning benchmarks while adding only approximately 7ms of latency.

Background & Motivation¶

Recent RLMs such as DeepSeek-R1, o4-mini, and Qwen3 have demonstrated remarkable capabilities on challenging tasks in mathematics, science, and programming. Selecting an appropriate RLM involves a performance–cost trade-off along two key dimensions: (1) model size—larger models yield better performance but incur higher cost; (2) reasoning budget—more thinking tokens improve performance but increase latency and expense.

A key observation is that over 50% of queries on MATH-500 can be correctly answered by Qwen3-0.6B with a minimal reasoning budget, whereas some difficult queries require stronger RLM configurations. Moreover, stronger RLMs may "overthink" simple problems, paradoxically degrading performance. This motivates a central question: how can one select the "just strong enough" RLM configuration for each query to maximize cost-efficiency without sacrificing performance?

Method¶

Overall Architecture¶

The RADAR framework consists of the following core components: 1. Discretization: Each RLM is discretized into multiple configurations $g = (m, u) \in \mathcal{G}$ according to available reasoning budgets. 2. Multi-Objective Optimization (MOO): Routing is formulated as a bi-objective optimization that maximizes performance while minimizing cost. 3. IRT Calibration: A 2PL IRT model jointly estimates query difficulty and configuration ability. 4. Adaptive Testing: The ability parameter of a new model configuration is rapidly estimated using a small set of dynamically selected queries.

Key Designs¶

Discretization and Configuration Routing: Each RLM $m \in \mathcal{M}$ is discretized by reasoning budget $u \in \mathcal{U}_m$ into configurations $g = (m, u)$. For open-source RLMs, budgets are enforced by counting thinking tokens and appending a truncation message when the budget is exceeded. Design Motivation: Unifying model selection and budget selection into a single configuration routing problem enables optimization over the configuration space.
Multi-Objective Optimization Routing: For each query $q$, the framework solves $g^* = \arg\max_{g \in \mathcal{G}} f(p_q(g), c_q(g))$, where $p_q(g)$ is the predicted performance and $c_q(g)$ is the predicted cost. Two scalarization techniques are explored:
Linear Scalarization: $\text{LSP}_q^{w_1} = \arg\max_{g \in \mathcal{G}} w_1 p_q(g) - (1-w_1) c_q(g)$
Chebyshev Scalarization: $\text{CSP}_q^{w_1} = \arg\min_{g \in \mathcal{G}} \max\{w_1|1-p_q(g)|, (1-w_1)c_q(g)\}$

Design Motivation: Chebyshev scalarization can recover non-convex regions of the Pareto frontier that linear scalarization cannot. This is the first work to introduce MOO techniques beyond linear scalarization for LLM routing.

2PL IRT Model: A two-parameter logistic model is used to parameterize the performance prediction function. To enable OOD generalization, query difficulty $b_j = \mathbf{w}_b^\top \mathbf{e}_j$ and discrimination $a_j = \mathbf{w}_a^\top \mathbf{e}_j$ are parameterized as linear transformations of query embeddings $\mathbf{e}_j$, and each configuration $g_i$ is assigned a scalar ability parameter $\theta_i$. The probability of a correct response is $p_{ij} = \sigma(a_j(\theta_i - b_j))$. Design Motivation: Scalar ability values capture an interpretable ordering across model configurations with fewer parameters than multidimensional IRT (MIRT), while generalizing to unseen queries via embeddings.
Adaptive Testing Extension: When estimating the ability parameter for a new model configuration, queries maximizing Fisher information are iteratively selected: $j_t = \arg\max_{j \in \mathcal{Q} \setminus \mathcal{S}_{t-1}} I(\hat{\theta}_{t-1}, a_j, b_j)$, where $I(\theta, a_j, b_j) = a_j^2 \sigma(a_j(\theta-b_j))[1-\sigma(a_j(\theta-b_j))]$. Design Motivation: Only approximately 12% of the training set needs to be evaluated to accurately estimate a new configuration's ability, enabling plug-and-play integration.

Loss & Training¶

The IRT model is trained with binary cross-entropy loss: $$\mathcal{L}_{2PL} = -\frac{1}{nk} \sum_{i=1}^n \sum_{j=1}^k [y_{ij} \log p_{ij} + (1-y_{ij}) \log(1-p_{ij})]$$

where $y_{ij} \in \{0,1\}$ indicates whether configuration $g_i$ correctly answers query $q_j$. In total, 1.75 million binary response records were collected, covering 35 configurations and 50,139 queries.

Key Experimental Results¶

Main Results (In-Distribution Setting, Hypervolume Metric, Higher is Better)¶

Benchmark	Random-Pair	RouterBench	IRT-Router	Radar (Ours)	Gain
GPQA-Diamond	0.5545	0.6866	0.6942	0.7513	+8% vs. runner-up
MMLU	0.6905	0.8592	0.8604	0.8720	+1.3%
MMLU-Redux	0.7281	0.9053	0.9117	0.9230	+1.2%
LSAT	0.6913	0.9125	0.9163	0.9188	+0.3%
FRAMES	0.6589	0.8325	0.8501	0.8762	+3.1%

Ablation Study¶

Configuration	Hypervolume	Note
Linear Scalarization (ID)	Marginally better	Slight advantage in ID setting
Chebyshev Scalarization (OOD)	Better	Clear advantage in OOD setting
20% Training Data	~Comparable	Similar performance with only 20% of data
RADAR (35 configs)	Baseline	Original 35 configurations
RADAR++ (43 configs)	Improved	Improvement after adding Qwen3-14B via adaptive testing

Key Findings¶

On MATH-500, RADAR achieves 90% of o4-mini (high budget) performance at only 1.31% of its cost.
On FRAMES (long-context multi-document QA), RADAR achieves 90% performance at 10% of the cost, whereas the runner-up requires 30%.
RADAR's routing latency is approximately 7ms, negligible compared to the ~870ms generation time of the smallest RLM configuration.
Adaptive testing requires only 12% of the training set (~5k queries) to accurately estimate the ability of a new configuration.
Estimated query difficulty shows moderate Pearson correlation (0.509) with five-level human-annotated difficulty labels on MATH-500.

Highlights & Insights¶

First to introduce MOO beyond linear scalarization for LLM routing: Chebyshev scalarization recovers non-convex regions of the Pareto frontier.
Psychometrics-inspired IRT model: Treating queries as exam items and model configurations as examinees yields a natural and interpretable framework.
Extreme cost savings: Achieving 90% performance at 1.31% cost on MATH-500 is a compelling result.
Plug-and-play design: No fine-tuning of RLMs is required; new models can be integrated quickly in a black-box manner.
Strong OOD generalization: Generalization to long-context multi-document QA tasks is particularly notable.

Limitations & Future Work¶

Cost prediction relies on a simple heuristic (average token count × unit price) and does not account for query-specific cost variation.
Generalization to highly difficult OOD benchmarks such as AIME is limited, with the router tending to assign lower-ability configurations.
Only text modality is considered; extension to multimodal reasoning scenarios remains unexplored.
The linear parameterization of 2PL IRT may be insufficient to capture complex difficulty–ability interactions.
The setting of aggregate budget constraints across batched queries is not addressed.

IRT-Router (Song et al., 2025): Uses multidimensional IRT (MIRT) with more parameters but non-scalar abilities; RADAR achieves interpretable ordering with scalar ability values.
RouterBench (Hu et al., 2024): Conventional model routing; this work extends it to RLM configuration-level routing.
Efficient inference methods (L1/S1, etc.): Complementary to RADAR and can be incorporated as additional configurations in the routing pool.
Adaptive testing in educational assessment: The Fisher information-based item selection strategy is successfully adapted from this domain.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of IRT and MOO is novel, though individual components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 benchmarks, 35 configurations, and 1.75 million data points — comprehensive and rigorous.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and intuitive figures.
Value: ⭐⭐⭐⭐⭐ Directly addresses a core challenge in practical RLM deployment with substantial cost savings.