Skip to content

RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Conference: ICLR 2026 arXiv: 2509.25426 Code: None Area: Interpretability Keywords: Reasoning Language Models, Model Routing, Item Response Theory, Multi-Objective Optimization, Adaptive Inference

TL;DR

This paper proposes RADAR, a framework that formulates adaptive inference for reasoning language models (RLMs) as a multi-objective optimization problem. It leverages Item Response Theory (IRT) to jointly estimate interpretable query difficulty and model configuration ability parameters, enabling lightweight and scalable query-level routing. RADAR outperforms state-of-the-art routing methods on 8 reasoning benchmarks while adding only approximately 7ms of latency.

Background & Motivation

Recent RLMs such as DeepSeek-R1, o4-mini, and Qwen3 have demonstrated remarkable capabilities on challenging tasks in mathematics, science, and programming. Selecting an appropriate RLM involves a performance–cost trade-off along two key dimensions: (1) model size—larger models yield better performance but incur higher cost; (2) reasoning budget—more thinking tokens improve performance but increase latency and expense.

A key observation is that over 50% of queries on MATH-500 can be correctly answered by Qwen3-0.6B with a minimal reasoning budget, whereas some difficult queries require stronger RLM configurations. Moreover, stronger RLMs may "overthink" simple problems, paradoxically degrading performance. This motivates a central question: how can one select the "just strong enough" RLM configuration for each query to maximize cost-efficiency without sacrificing performance?

Method

Overall Architecture

The RADAR framework consists of the following core components: 1. Discretization: Each RLM is discretized into multiple configurations \(g = (m, u) \in \mathcal{G}\) according to available reasoning budgets. 2. Multi-Objective Optimization (MOO): Routing is formulated as a bi-objective optimization that maximizes performance while minimizing cost. 3. IRT Calibration: A 2PL IRT model jointly estimates query difficulty and configuration ability. 4. Adaptive Testing: The ability parameter of a new model configuration is rapidly estimated using a small set of dynamically selected queries.

Key Designs

  1. Discretization and Configuration Routing: Each RLM \(m \in \mathcal{M}\) is discretized by reasoning budget \(u \in \mathcal{U}_m\) into configurations \(g = (m, u)\). For open-source RLMs, budgets are enforced by counting thinking tokens and appending a truncation message when the budget is exceeded. Design Motivation: Unifying model selection and budget selection into a single configuration routing problem enables optimization over the configuration space.

  2. Multi-Objective Optimization Routing: For each query \(q\), the framework solves \(g^* = \arg\max_{g \in \mathcal{G}} f(p_q(g), c_q(g))\), where \(p_q(g)\) is the predicted performance and \(c_q(g)\) is the predicted cost. Two scalarization techniques are explored:

  3. Linear Scalarization: \(\text{LSP}_q^{w_1} = \arg\max_{g \in \mathcal{G}} w_1 p_q(g) - (1-w_1) c_q(g)\)

  4. Chebyshev Scalarization: \(\text{CSP}_q^{w_1} = \arg\min_{g \in \mathcal{G}} \max\{w_1|1-p_q(g)|, (1-w_1)c_q(g)\}\)

Design Motivation: Chebyshev scalarization can recover non-convex regions of the Pareto frontier that linear scalarization cannot. This is the first work to introduce MOO techniques beyond linear scalarization for LLM routing.

  1. 2PL IRT Model: A two-parameter logistic model is used to parameterize the performance prediction function. To enable OOD generalization, query difficulty \(b_j = \mathbf{w}_b^\top \mathbf{e}_j\) and discrimination \(a_j = \mathbf{w}_a^\top \mathbf{e}_j\) are parameterized as linear transformations of query embeddings \(\mathbf{e}_j\), and each configuration \(g_i\) is assigned a scalar ability parameter \(\theta_i\). The probability of a correct response is \(p_{ij} = \sigma(a_j(\theta_i - b_j))\). Design Motivation: Scalar ability values capture an interpretable ordering across model configurations with fewer parameters than multidimensional IRT (MIRT), while generalizing to unseen queries via embeddings.

  2. Adaptive Testing Extension: When estimating the ability parameter for a new model configuration, queries maximizing Fisher information are iteratively selected: \(j_t = \arg\max_{j \in \mathcal{Q} \setminus \mathcal{S}_{t-1}} I(\hat{\theta}_{t-1}, a_j, b_j)\), where \(I(\theta, a_j, b_j) = a_j^2 \sigma(a_j(\theta-b_j))[1-\sigma(a_j(\theta-b_j))]\). Design Motivation: Only approximately 12% of the training set needs to be evaluated to accurately estimate a new configuration's ability, enabling plug-and-play integration.

Loss & Training

The IRT model is trained with binary cross-entropy loss: $\(\mathcal{L}_{2PL} = -\frac{1}{nk} \sum_{i=1}^n \sum_{j=1}^k [y_{ij} \log p_{ij} + (1-y_{ij}) \log(1-p_{ij})]\)$

where \(y_{ij} \in \{0,1\}\) indicates whether configuration \(g_i\) correctly answers query \(q_j\). In total, 1.75 million binary response records were collected, covering 35 configurations and 50,139 queries.

Key Experimental Results

Main Results (In-Distribution Setting, Hypervolume Metric, Higher is Better)

Benchmark Random-Pair RouterBench IRT-Router Radar (Ours) Gain
GPQA-Diamond 0.5545 0.6866 0.6942 0.7513 +8% vs. runner-up
MMLU 0.6905 0.8592 0.8604 0.8720 +1.3%
MMLU-Redux 0.7281 0.9053 0.9117 0.9230 +1.2%
LSAT 0.6913 0.9125 0.9163 0.9188 +0.3%
FRAMES 0.6589 0.8325 0.8501 0.8762 +3.1%

Ablation Study

Configuration Hypervolume Note
Linear Scalarization (ID) Marginally better Slight advantage in ID setting
Chebyshev Scalarization (OOD) Better Clear advantage in OOD setting
20% Training Data ~Comparable Similar performance with only 20% of data
RADAR (35 configs) Baseline Original 35 configurations
RADAR++ (43 configs) Improved Improvement after adding Qwen3-14B via adaptive testing

Key Findings

  • On MATH-500, RADAR achieves 90% of o4-mini (high budget) performance at only 1.31% of its cost.
  • On FRAMES (long-context multi-document QA), RADAR achieves 90% performance at 10% of the cost, whereas the runner-up requires 30%.
  • RADAR's routing latency is approximately 7ms, negligible compared to the ~870ms generation time of the smallest RLM configuration.
  • Adaptive testing requires only 12% of the training set (~5k queries) to accurately estimate the ability of a new configuration.
  • Estimated query difficulty shows moderate Pearson correlation (0.509) with five-level human-annotated difficulty labels on MATH-500.

Highlights & Insights

  • First to introduce MOO beyond linear scalarization for LLM routing: Chebyshev scalarization recovers non-convex regions of the Pareto frontier.
  • Psychometrics-inspired IRT model: Treating queries as exam items and model configurations as examinees yields a natural and interpretable framework.
  • Extreme cost savings: Achieving 90% performance at 1.31% cost on MATH-500 is a compelling result.
  • Plug-and-play design: No fine-tuning of RLMs is required; new models can be integrated quickly in a black-box manner.
  • Strong OOD generalization: Generalization to long-context multi-document QA tasks is particularly notable.

Limitations & Future Work

  • Cost prediction relies on a simple heuristic (average token count × unit price) and does not account for query-specific cost variation.
  • Generalization to highly difficult OOD benchmarks such as AIME is limited, with the router tending to assign lower-ability configurations.
  • Only text modality is considered; extension to multimodal reasoning scenarios remains unexplored.
  • The linear parameterization of 2PL IRT may be insufficient to capture complex difficulty–ability interactions.
  • The setting of aggregate budget constraints across batched queries is not addressed.
  • IRT-Router (Song et al., 2025): Uses multidimensional IRT (MIRT) with more parameters but non-scalar abilities; RADAR achieves interpretable ordering with scalar ability values.
  • RouterBench (Hu et al., 2024): Conventional model routing; this work extends it to RLM configuration-level routing.
  • Efficient inference methods (L1/S1, etc.): Complementary to RADAR and can be incorporated as additional configurations in the routing pool.
  • Adaptive testing in educational assessment: The Fisher information-based item selection strategy is successfully adapted from this domain.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of IRT and MOO is novel, though individual components are not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 benchmarks, 35 configurations, and 1.75 million data points — comprehensive and rigorous.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and intuitive figures.
  • Value: ⭐⭐⭐⭐⭐ Directly addresses a core challenge in practical RLM deployment with substantial cost savings.