IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory¶

Conference: ACL 2025
arXiv: 2506.01048
Code: https://github.com/Mercidaiha/IRT-Router
Area: Interpretability
Keywords: LLM routing, item response theory, multi-model selection, interpretability, cost optimization

TL;DR¶

IRT-Router borrows Item Response Theory (IRT) from psychometrics, treating LLMs as "test-takers" and queries as "exam questions." It learns multi-dimensional ability vectors along with difficulty and discrimination parameters to achieve interpretable multi-LLM routing, achieving over 87% accuracy in OOD scenarios at only 1/30 of the cost of GPT-4o.

Background & Motivation¶

Background¶

Background: When using multiple LLMs, it is necessary to automatically select the most appropriate model based on query characteristics to balance performance and cost.

Limitations of Prior Work: Existing routing methods (RouteLLM, RouterBench) utilize simple heuristics or blackbox classifiers, which lack interpretability and fail to explain "why a query is routed to a specific model."

Key Challenge: The need to simultaneously address three challenges: interpretability, cold-start (how to route new queries), and the performance-cost trade-off.

Core Idea: IRT naturally models the "ability-difficulty" relationship, and transferring it to LLM routing yields both interpretability and effectiveness.

Method¶

Overall Architecture¶

Two implementation variants: (1) MIRT-Router (Multi-dimensional IRT): $\hat{P}(q_i, M_j) = 1/(1 + \exp(-a_i^T \theta_{M_j} + b_i))$, where $\theta_{M_j}$ represents the LLM ability vector, $a_i$ represents discrimination, and $b_i$ represents difficulty; (2) NIRT-Router (Neural IRT): introduces relevance vectors and neural network interaction functions.

Key Designs¶

IRT Modeling: Each LLM is assigned a multi-dimensional ability vector $\theta_{M_j}$, and each query has a difficulty $b_i$ and a discrimination $a_i$. The parameters are learned via embedding + linear transformation.
Cold-start Warm-up: For unseen queries, interpolation using neighboring known query embeddings is applied: $e_{q_i}' = (1-\lambda) e_{q_i} + \lambda \cdot \text{mean(neighbors)}$, where $\lambda=0.3\text{-}0.4$ is optimal.
Scoring Function: $S(q_i, M_j) = \alpha \hat{P}(q_i, M_j) - \beta C(M_j)$, where $\alpha+\beta=1$ to balance performance and cost.

Key Experimental Results¶

Main Results¶

Method	Accuracy	Cost	Reward
MIRT-Router	80.67%	$0.42	63.89
RouterBench	80.01%	$1.15	62.23
RouteLLM	77.25%	$12.80	42.00
GPT-4o only	77.53%	$12.93	42.02

OOD Scenarios (20 candidate LLMs, 12 datasets)¶

Method	Accuracy	Cost
MIRT-Router	87.12%	$0.14
NIRT-Router	87.37%	$0.15

Top-k Routing Accuracy (MIRT-Router)¶

Scenario	Top-1	Top-2	Top-3	Top-5
ID	2.72%	9.88%	32.51%	47.85%
OOD	2.15%	7.50%	27.29%	39.47%

Top-1 accuracy is relatively low because multiple LLMs have similar abilities (e.g., both DeepSeek-Chat and DeepSeek-Coder reach 81%), but Top-3 accuracy reaches 32.5%. Routing analysis shows that 80% of high-difficulty queries are routed to DeepSeek-Chat, while 99% of low-difficulty queries are routed to the most cost-effective model, Qwen2.5-32B-GPTQ ($0.2/M).

Key Findings¶

Optimal Performance-Cost Trade-off: The accuracy is 3% higher than GPT-4o, while the cost is only 1/30.
Interpretability: The ability vectors and difficulty scores possess clear semantics (e.g., DeepSeek-Chat achieves the highest ability of 81%, while GPT-4o achieves 78%).
Effective Cold-start: The warm-up mechanism significantly boosts OOD performance, with a more substantial impact on NIRT-Router.

Highlights & Insights¶

Elegant cross-domain transfer from IRT to LLM routing: The mature theory of psychometrics is directly applied to LLM capability evaluation.
Interpretability is the core selling point: Beyond strong routing performance, it provides explanations for what each LLM specializes in and where the difficulty lies in each query.

Limitations & Future Work¶

Top-1 routing accuracy is relatively low (2-3%) because multiple models share similar capabilities, resulting in close proximity of ability vectors in high-dimensional space.
Generalization to entirely original/unseen LLMs during training is limited, requiring a small amount of calibration data to initialize the ability vector of the new model.
The router is insufficiently sensitive to changes in cost parameters; fine-tuning the ratio of $\alpha/\beta$ has limited influence on routing decisions.
The proximity metric for cold-start warm-up relies on Euclidean distance within the embedding space, which may not accurately reflect the similarity in true query difficulty.
The IRT model assumes static capabilities and difficulties, yet LLM capabilities can change as API versions update, necessitating periodic recalibration.
Although the NIRT version introduces neural networks to enhance flexibility, it sacrifices the complete interpretability of the MIRT version.

Rating¶

Novelty: ⭐⭐⭐⭐ The application of IRT to LLM routing represents an ingenious cross-domain transfer.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 20 LLMs × 12 datasets × ID+OOD scenarios.
Writing Quality: ⭐⭐⭐⭐ Clear presentation of IRT theory combined with thorough interpretability analysis.
Value: ⭐⭐⭐⭐⭐ Directly practical for multi-LLM deployment scenarios.